5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.3

置信度

正确性2.5

贡献度1.8

表达2.5

ICLR 2025

From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training

Julius Berner,Lorenz Richter,Marcin Sendera,Jarrid Rector-Brooks,Nikolay Malkin

OpenReview PDF

提交: 2024-09-13更新: 2025-02-05

TL;DR

We find theoretical connections between discrete-time and continuous-time training objectives for diffusion samplers and show their empirical implications for faster training.

摘要

关键词

diffusionvariational inferenceSDEsPDEssamplingstochastic processesGFlowNets

评审与讨论

审稿意见

评分: 6置信度: 42024-10-21

This paper explores the connection between continuous-time SDEs and their discretization, particularly focusing on the influence of the chosen timestep. It demonstrates that using non-uniformly discretized time with fewer steps can achieve similar performance during inference. Theoretical results are provided to support this approach.

优点

The problem studied in this paper is well-motivated.
This paper presents extensive results in both theory and experiments.
The appendix provides a comprehensive complement to the main text.

缺点

The theoretical results in Section 3 primarily focus on the convergence of the Euler-Maruyama method. Specifically, they show that convergence is ensured as the maximal step size approaches zero. However, these results do not explain why non-uniform discretization would generally be superior to uniform discretization. The advantage of non-uniform discretization—one of the main contributions of this paper—is demonstrated only through experiments
As previously mentioned, there seems to be a gap between the theoretical and empirical sections of this paper. After reading the introduction, I expected to see concrete theoretical results that justify the use of non-uniform discretization. However, simply showing that convergence is guaranteed as $\Delta t$ approaches zero is unsurprising. The authors might consider adding more discussion on why uniform discretization is not always the optimal choice
It has been proven that the order of convergence is determined by the step size, and the Euler-Maruyama scheme with uniform discretization has been shown to achieve optimal performance in the general case (see 'Numerical Treatment of Stochastic Equations' by Rümelin, 1982). I wonder if the claim made in this paper contradicts that result.

I would be willing to increase my rating if the authors are able to address my concerns.

问题

Please see the weaknesses.

评论- Response

2024-11-22

Dear Reviewer ofV6,

We appreciate your constructive feedback on our paper.

On theory and experiments

We kindly direct you to the response to all reviewers for a detailed answer.

The theory and experiments serve somewhat complementary purposes: while the theory establishes the (perhaps unsurprising, but certaintly not trivial) fact that training with varying time discretization is well-behaved in the limit, the experiments explore the implications of the fact that such training is theoretically justified. Our exploration of nonuniform discretization yielded interesting empirical results, but they are indeed not explained by the theoretical ones (nor by any theory in existing work that we are aware of).

We suspect the main reason that nonuniform time discretization is helpful is that with uniform discretization, the sampler overfits to the small, fixed set of inputs of the time variable to the neural network computing the drift. This is supported by the evidence that our proposed Equidistant discretization -- in which the distribution over time steps seen during training has nearly full support but all time increments except the first and last have length $\frac1N$ -- gives similar results to the Random discretization, in which increments have different sizes.

We are happy to include extended discussion of this in the paper.

On Euler-Maruyama numerics

There may be a small misunderstanding present regarding the Rümelin's result and its relevance.

First, please correct us if we are wrong, but Rümelin, in the paper you referenced, assumes uniform step size $h$ and proves the orders of convergence of various integrators (Euler-Maruyama, Heun, Runge-Kutta) under this assumption (see the bottom of p.605 here).

Second, such a result would concern the order of convergence of integration, which we are interested in when sampling a trained model. Indeed, in our experiments, we evaluate all models with a uniform time discretization. However, the choice of discretization is important during training, where (as shown by our results) a nonuniform discretization gives a better approximate learning signal than a uniform one. Because the gradient of the continuous-time divergence is not an Itô integral whose Euler-Maruyama approximation is the gradient of the corresponding divergence in the discretization, this does not contradict the mentioned result.

Thank you again for your comments. We are happy to answer any further questions you may have.

2024-11-26

Thanks for your response. However, my primary concern regarding the gap between the theoretical analysis and experimental results remains unresolved. While I agree that uniform discretization is not always the optimal approach in practice and other methods such as adaptive mesh refinement are often more effective, I still have a concern regarding the foundation of the random discretization scheme. Specifically, its "randomness" leads to high performance variance, which may be difficult to control or quantify. I am not sure if this method can really perform robustly in practice.

I appreciate the authors' efforts to address these concerns and have adjusted my score to 6. However, my overall stance on this work is completely neutral.

2024-11-26

Thank you for your response and for adjusting your score.

We believe that we can clarify your concern regarding the theoretical gap.

First, your concern seems to be about inference-time, not training-time, discretization. Thus, we just want to repeat that all our models are evaluated with uniform time discretization.

Now, let us discuss the choice of discretization scheme in each phase:

Uniform discretization at inference time: During inference, i.e., sampling, of all trained models, we use a uniform discretization. Thus, the variance in the performance of the Random and Equidistant training methods during inference is comparable to the variance of the Uniform method.

Regarding the discretization error for SDE inference in general:

We can leverage results in, e.g., De Bortoli et al., NeurIPS 2021 (Section 2.2) to bound the error between the data distribution and the learned distribution for a given step-size of the Euler-Maruyama scheme and approximation of the optimal policy.
The dependence of the error on the discrepancy of the learned sampler from the optimal one is also studied in Lee et al., NeurIPS 2022 and Chen et al., ICLR 2023 (Section 3.1).
Bounds specific to neural network function classes are also proved by Oko et al., ICML 2023.

However, in our case, the 100-step uniform integration we use -- following many past studies on the same set of target densities -- is already quite close to continuous-time integration for the distributions considered (Figure 5 is evidence of this fact). Summarizing, the inference-time discretization scheme is not the main source of error, nor is it the focus of our work.

Randomized discretization at training time: Our results show that the non-uniform discretization schemes applied during training give better performance with uniform-step-size inference. Our theoretical results show that the learning objective under any discretization scheme converges to the same continuous-time object as the maximal step size goes to zero. However, it does not tell us which discretization scheme for training is optimal given a fixed number of steps (cf. our original answer above).

The results suggest that the reason for the better performance of the non-uniform discretization schemes is simply one of generalization over the values $t$ given as input to the drift model $\overrightarrow{\mu}(x,t)$ . If one trains in a uniform discretization with different step size than used for evaluation, there will be a generalization gap. However, with the non-uniform training schemes, values of $t$ that occur at evaluation time are also seen during training -- the generalization gap is thus smaller. Said otherwise, our results are compatible with the hypothesis that the time generalization error is more significant than the discretization approximation error at training time in the settings considered.

Does this clarify your concern?

评论- Follow-up

2024-12-02

Dear reviewer ofV6,

Thank you again for your response and your additional feedback. As the discussion period is coming to an end, we would like to ask you if our response above was able to clarify your remaining concern or if you would appreciate any further clarification?

2024-12-03

I would like to thank the authors for their detailed response. I will discuss with the other reviewers and the AC to reach a final decision.

审稿意见

评分: 5置信度: 22024-11-01

This paper discusses the relationship between continuous and discrete-time stochastic processes and their training. In particular, the main results give a series of propositions on how a discrete-time process can approximate a continuous time process. I have to say I had a hard time understanding the "big picture" of the authors' results.

优点

The paper appears to be mathematically rigorous and experiments appear to give credence to the authors' work.

缺点

I found the paper very difficult to read. The notation is dense, not all appears to be defined, some is non-standard and unclear and I found it a little tricky to understand exactly what the authors wanted to do. It may be that the authors have solved in interesting problem in a genuinely useful way but that was unclear from the paper. All but the very expert reader would, in my view, find the paper a difficult read.

A few specific comments are:

Abstract could be more informative and precise
Introduction is quite meandering and I wasn't quite clear on exactly what the authors were trying to do.
Figures 1 and 2 were placed, in my view, quite early in the paper and were hard to interpret. They needed more textual description, or, considering where they were placed, needed "dumbing down" a little.
Equation (1) is somewhat standard but, for completeness, it would have been useful to know what \sigma(t) is (I could guess). Equation (1) is similar to (9) apart from \mu(t). I think the differences between the various forms of \mu(t) needs to be explained in more detail.
It wasn't clear to me exactly what the reverse arrow meant in terms of policy e.g. the backwards arrow is used on \pi(t) below equation (3) but without any definition as far as I can see.
I found Section 2 quite muddled with various different concepts introduced with not too much explanation. I realise there is a page limit, but it was bordering on the unpenetrable.
I didn't really understand how the Propositions in Section 3 ended up affecting the Results in Section 4. Perhaps I am dense, but it would be good if the authors could explain this better.

问题

My main question is this: for a ML practitioner, how will the authors' results help?

评论- Response

2024-11-22

Dear Reviewer hyjK,

Thank you for your review and you valuable feedback. Let us comment on your concerns in the following.

Presentation of the paper

While we put effort in making the paper as accessible as possible, we acknowledge that the presentation can be dense and requires certain background knowledge. Given the space constraints, we tried to strike a balance between providing necessary background information and citing the relevant literature for further details. However, the level of necessary detail naturally depends on the background of the reader. For instance, we also received feedback from another reviewer that we should shorten the background sections.

In general, we need to introduce several concepts (in a mathematically rigorous way), such as measures on tractories, likelihood ratios, and objectives, in both discrete as well as continuous time, since they are required by our novel theory. While we try to offer intuition, such concepts rely on knowledge in (numerical) stochastic analysis, GFlowNets, and optimal control, which is difficult to fully explain in the limited amount of space. However, we purposefully start our presentation in discrete time, since this allows to understand the concepts without background in stochastic calculus.

As suggested by you, we polished our presentation and added several details (e.g., defining $\sigma$ , unifying the notation for $\overset{\leftarrow}{\mu}$ , explaining the backward policy $\overset{\leftarrow}{\pi}$ ) and (slightly) simplified Figure 2. Moreover, we added a paragraph explaining the connection between our theoretical results and the experiments at the end of Section 3.

Given that a main contribution of our paper is of theoretical nature (a rigorous foundation for the training of diffusion-based samplers), we hope that you understand that we need to assume certain theoretical background knowledge. Please let us know if you have further suggestions on improving the accessibility of our paper.

Practical implications of our work

On a high-level, we show for the first time that different discretizations of training objectives approximate the same, unique continuous-time objective (in the limit of refining the discretization). Our results have a simple, but profound implication:

One can use different discretizations for training and inference of diffusion-based samplers without incurring bias.

This finding is particularly important given that diffusion-based methods for sampling problems rely on a SDE discretization during training, incurring significant computational costs. Motivated by our theoretical results, we show for the first time that a few, randomized steps during training can achieve similar performance at a fraction of the cost. Our theoretical results (covering local as well as global objectives) hold for virtually all existing diffusion-based samplers and we also obtain consistent empirical outcomes across methods and tasks.

We note that our findings are specific to sampling problems. In generative modeling, where we have access to samples from the target distribution, one leverages score-matching objectives that allow training in continuous time (without SDE discretizations), i.e., directly optimizing the continuous-time ELBO. We present further explanations in our general response and hope that this helps to understand the significance of our work to ML practitioners.

Thank you again for your comments. We are happy to answer any further questions you may have.

2024-11-27

Thanks for the response and for the changes made. I can appreciate it is difficult to make the paper entirely self-contained given the page limits. I still think the paper is difficult to read though and, to be honest, I understood a lot more about the paper from reading the other reviewers' critique. I am not questioning the validity of your results and the right audience will appreciate your paper.

2024-11-27

Dear Reviewer hyjK,

We want to thank you for your response and additional feedback. We have already revised the paper based on the discussion to make it more approachable for a wider audience and welcome any further suggestions.

Moreover, thank you for acknowledging the validity of our results and that there is an audience that would appreciate our paper. Can we ask you to consider raising your score and help our paper to be noticed by the interested audience?

The authors

2024-12-03

Thanks for the response, but I still think the paper is a difficult read and in particular I personally find the boundary between existing results and new ones a little hazy. I realize some of this is down to page limits but this is true for all authors.

审稿意见

评分: 5置信度: 42024-11-03

This paper examines training neural stochastic differential equations (SDEs) to sample from Boltzmann distributions without target samples. This work derives asymptotic equivalences by linking discrete-time policies to continuous-time diffusion. The approach is validated on sampling benchmarks.

优点

The approach of linking discrete-time policy objectives with continuous-time SDE training is a useful idea, albeit heavily reliant on established results.
Authors show that this method potentially reduces computational costs for neural SDE training.

缺点

Firstly, I think the presentation of this work remains a major bottleneck for readers. Section 2 is preliminary, and it spans from pages 3 to 7. Such a lengthy preliminary section introduces well-known equations and results (e.g., equations (4)-(6) from GFlowNet papers, (9)-(15) from stochastic control and diffusion models, and (16), (17) as standard Euler-Maruyama discretizations). These derivations, mostly grounded in existing work, dilute the contributions and add an undue burden for readers. Figures like Figure 3, which illustrate obvious points, seem unnecessary and further contribute to this issue. It is recommended to present additional informative and easy-to-follow diagrams in these sections.
The primary theoretical contribution—showing asymptotic convergence from Euler-Maruyama discretization to continuous-time SDEs (Propositions 3.2, 3.3, 3.4)—seems not surprising. The convergence results are probably straightforward applications of established SDE theory, with little added insights or unique techniques. Without further exploration of new derivation techniques or distinctive theoretical angles, the contributions feel like direct applications of existing results.
The experiments are conducted on standard synthetic benchmarks, such as Gaussian mixtures and low-dimensional toy distributions. To support this approach, it might be necessary to conduct higher-dimensional Bayesian inference tasks where the Boltzmann distribution is more untractable. Besides, the compared baselines exclude many recent models, such as flow-based generative models.
While efficiency is demonstrated, additional benchmarks comparing computational costs with traditional methods in larger dimensions would be helpful for real-world applications.

问题

Could the authors clarify why so much space is devoted to standard results? Would simplifying or condensing this content help highlight the unique contributions?
Beyond applying existing convergence results, what novel techniques, if any, were introduced in proving Propositions 3.2, 3.3, and 3.4?
Would more complex or realistic benchmarks alter the experimental outcomes, particularly in high-dimensional or non-Markovian sampling settings?

评论- Response (1/2)

2024-11-22

Dear Reviewer K8pm,

Thank you for your extensive and constructive feedback. We will comment on your concerns and questions in the following:

On the presentation of our work

Thank your for providing suggestions on the presentation of our paper. Generally, we believe that there is no prior work that connects these "preliminary" results (many of which only appeared in the last two years) and we hope that Figures 1 and 2 can serve as informative diagrams for navigating the paper. The motivation behind the current presentation is to make our theoretical results accessible to as many readers as possible:

We start with the discrete-time approach since it requires less background knowledge in stochastic calculus. We agree that these are well-known results from GFlowNet papers, however, we think that many readers might not be familiar with this theory. For instance, GFlowNets themselves have only been invented in 2021 (by Bengio et al.) and the theory for continuous state spaces was developed only in 2023 (by Lahlou et al.). Similarly, the off-policy losses have only been introduced to diffusion samplers in 2023 or 2024. However, based on your suggestion, we tried to shorten the exposition and moved Figure 3 to the Appendix.
In the section on the continuous-time setting, we need to introduce concepts such as Nelson's identity and the Fokker-Planck equation since our results are based on them. Moreover, we want to emphasize that the Radon-Nikodym derivative between general forward and backward SDEs, as well as the resulting KL divergence are very recent results that appeared only this year (by [Vargas et al., 2024] and [Richter & Berner, 2024]). Thus we would not call them well-known results.

Please let us know if this clarifies our exposition. Otherwise, we are happy to restructure our paper further to faciliate readability.

Theoretical claims and contribution

We would like to refer you to the general response for a detailed exposition of our contributions. In particular, we detail why our theoretical results provide substantial new insights and are not just straightforward applications of existing theory.

More challenging tasks

We want to emphasize that our tasks include many of the arguably most challenging tasks on which samplers are typically benchmarked (see, e.g., the recent large-scale benchmark by Blessing et al., 2024). In particular, our tasks cover up to $1600$ -dimensional problems in Bayesian statistics, $32$ -dimensional highly multi-modal ( $2^{16}$ modes) problems related to models in molecular dynamics, as we as complex posterior distributions from variational auto-encoders trained on image distributions.

评论- Response (2/2)

2024-11-22

Flow-based generative models as baselines

This is an interesting question.

First, we would like to point out that the main purpose of our experiments is to illustrate our theory and show the substantial benefit of using randomized discretizations during training. Our considered samplers have already been benchmarked against other baselines in the respective papers and we show that we can even further improve their performance.

Second, and more importantly, flow-based generative models are not directly applicable in our setting for two reasons:

They train ODEs (i.e., deterministic dynamics), while we consider SDEs;
We only have access to the (unnormalized) density of our target, but no samples. Thus, we cannot construct interpolants to train with the flow matching objective.

That said, frameworks that use approximate samples from a target distribution to train flow matching models have indeed been proposed. Two such methods were suggested in Tong et al., 2024 for flow matching: training on samples obtained by an MCMC on the target density and on importance-weighted samples from the prior density. More sophisticated approaches, using MCMC guided by the learned ODE, were proposed in very recent work such as Cabezas et al., 2024.

We performed an experiment on the 10-dimensional Funnel density, replicating the settings of [Tong et al, 2024]. We consider the importance-weighted method as well as an MCMC fit to 15k (50x the batch size) samples from the target. For flow matching, we consider both the algorithm as proposed by Lipman et al., 2023 -- equivalent to linear interpolants over a uniform coupling of noise and data -- and the optimal transport conditional flow matching (OT-CFM) introduced by Tong et al., 2024, which should yield straighter integration curves.

The results are shown in the table below. For the ODEs, we report both 100-step Euler integration (comparable to the 100-step Euler-Maruyama integration used for SDEs in our paper) and an adaptive higher-order solver. For ODEs, the $\Delta\log Z$ is estimated using the Hutchinson trace estimator to compute the sampling density. It should be noted that the metric for ODEs is KL divergence between estimated and target marginal distributions, while the metric for SDEs is a KL divergence between trajectory distributions and thus only an upper bound on the former, meaning that the gap in performance (i.e., the true gap between marginal KLs) could be even greater than the table suggests.

Method $\downarrow$ Metric and integrator $\rightarrow$	$\Delta\log Z$ (100-step Euler[-Maruyama])	$\Delta\log Z$ (Dormand-Prince tolerance $10^{-3}$ )
IW + FM	1.51 $\pm$ 0.23	1.52 $\pm$ 0.33
MCMC + FM	4.59 $\pm$ 3.53	1.94 $\pm$ 0.88
IW + OT-CFM	2.09 $\pm$ 0.92	1.46 $\pm$ 0.16
MCMC + OT-CFM	4.01 $\pm$ 3.57	0.83 $\pm$ 0.04
TB (10-step Random training)	0.76 $\pm$ 0.02
PIS (10-step Random training)	0.72 $\pm$ 0.02
TB (100-step training)	0.54 $\pm$ 0.02
PIS (100-step training)	0.52 $\pm$ 0.02

We see that the flow matching models, integrated with the same number of steps, struggle to approach the sampling performance of the SDEs trained using differentiable simulation (PIS) or off-policy RL (TB) objectives. In more complex sampling tasks where MCMCs are slow to converge and importance weights have higher variance, we would expect these differences to be amplified; on the other hand, RL objectives -- while less efficient during training -- are asymptotically unbiased in the limit of continuous time, as we prove in our paper.

Thank you again for your comments. We are happy to answer any further questions you may have.

评论- Follow-up

2024-11-29

Dear Reviewer K8pm,

We would like to thank you once again for your review and feedback, which we believe makes this paper stronger. We've responded to all the issues that you mentioned and added the additional experiments on flow-based generative models as baselines.

Since the discussion period is approaching its end, could you please let us know if these have satisfactorily addressed your concerns? Please, let us know if you have any additional questions. We look forward to hearing from you.

Thank you,

The authors

审稿意见

评分: 5置信度: 32024-11-12

This paper investigates the training of diffusion samplers and neural stochastic differential equations (neural SDEs) by examining the connection between continuous-time objectives and their discrete-time counterparts. The authors establish that global objectives for discrete-time policies converge to path-space measure divergence objectives in the continuous-time limit, while local constraints asymptotically align with partial differential equations governing the time evolution of marginal densities. This theoretical grounding aims to bridge reinforcement learning (RL) objectives and stochastic control frameworks for diffusion processes. Empirically, the paper demonstrates that training with coarse, non-uniform time steps, particularly with random placements, can achieve substantial computational efficiency gains while retaining strong performance across a range of benchmarks.

优点

The paper is very well written and easy to follow, with clear exposition of the mathematical derivations and the empirical results.
The experimental section is thorough and well designed, exploring the effects of different discretization strategies and their impact on performance in detail. The benchmarks used are diverse and represent a wide range of sampling challenges.
The work provides strong empirical evidence that non-uniform time discretization (particularly random placement) improves training efficiency. This observation could be highly relevant for practitioners working with high-dimensional diffusion models. Furthermore, the identification of random time discretization as a performant strategy is novel and supported by robust experimental evidence.
The paper effectively summarizes existing methods and objectives for diffusion sampling, offering a clear context for the proposed contributions and situating them within the broader body of work on diffusion models and sampling techniques.

缺点

While the theoretical contributions are valuable and provide an interesting link between discrete-time and continuous-time objectives, they are not completely unexpected and partly already present in the literature.
In the experimental results, it is noted that the ELBO gap does not converge to zero as the discretization becomes finer but instead appears to stabilize at a positive value. The authors do not give an explanation for this phenomenon. In particular, the lack of a "benchmark" makes difficult to connect these simulations to the numerical results presented in the first part of the paper above.
The observed performance gains with randomly placed time steps are well supported by empirical results, but the paper does not provide a theoretical explanation for why this approach works so well. Offering more insight into this phenomenon would enhance the overall impact of the findings.

问题

Is it correct to expect that the ELBO gap should converge to zero as the discretization becomes finer, or are there inherent limitations in the approach that cause the gap to saturate at a positive value? Clarifying this could help contextualize the observed results better.
Are there any existing benchmarks or prior work that provide a comparable measure of ELBO gap performance for optimally trained diffusion samplers? How do the proposed methods stack up in this context?
Can the authors provide more insight into why random placement of time steps works so (unexpectedly) well? Is there an intuitive or theoretical rationale for this observed behavior?
In Theorem 3.4, there seems to be a potential issue as $\vec μ_t$ appears twice in the statement. Could this be a mistake, or is there a specific reasoning behind this repetition? Clarification would be helpful.

评论- Response

2024-11-22

Dear Reviewer gLPG,

Thank you your extensive review and appreciating our clear exposition as well as thorough experiments. Let us answer your remaining questions and concerns in the following:

On our theoretical contributions

While perhaps not being completely unexpected, we emphasize that there is also not necessarily a reason to believe that training objectives evaluated at different discretizations converge to a unique continuous-time object. While the required background knowledge in Section 2 is known (as also referenced in our paper), our links between discrete-time and continuous-time objectives are not present in the literature. In fact, previous literature has overlooked potential issues of training in discrete time. We present further clarifications in our general response, also contrasting our results with the setting in generative modeling.

Practical implications and intuition

In the general response, we also elaborate why our theoretical results are crucial for being able to train and evaluate at different discretizations (see also the new paragraph at the end of Section 3). Since our results guarantee that we are approximating the same continuous-time objective with different training discretization, we have not been too surprised by the great performance using only a few time-steps during training. In particular, our results also explain why other considered randomized discretizations (such as "equidistant") offer performance improvements similar to our considered "random" one; see Appendix D.1. Intuitively, one could argue that a fixed uniform discretization makes the model overfit by trying to counteract the discretization error incurred the SDE integrator. Since the time-step $\Delta t_n$ is fixed, the model does not learn the optimizer of the continuous-time objective and cannot generalize to other discretizations during inference.

On the ELBO gap

We note that for the ELBO $\log\widehat Z$ it holds that $\mathbb{E}[\log\widehat Z] = \log Z - D_{\mathrm{KL}}(\widehat{\mathbb{P}},\widehat{\mathbb{Q}}) \le \log Z$ , which we made clearer in our metrics in Appendix D.1 now. In particular, this shows that the gap is given by $D_{\mathrm{KL}}(\widehat{\mathbb{P}},\widehat{\mathbb{Q}})$ , which is only zero if the forward and reverse discrete-time processes are perfect time-reversals. In general, we can not expect that our SGD-based training finds a global minimum. Beyond that (and independent of the expressivity of our neural networks), a perfect time-reversal with Gaussian transition kernels (as given by Euler-Maruyama discretization) is not generally possible in discrete time (as also mentioned in our paper). While this has been ignored in previous work, our theory guarantees that training with different discretizations nevertheless approximates the same continuous-time objective.

When the number of steps tend to infinity during inference, the gap is thus given by the continuous-time KL divergence $D_{\mathrm{KL}}(\mathbb{P},\mathbb{Q})$ (which only vanishes for a perfectly trained model). Nevertheless, the ELBO gap provides a principled metric since lower values necessarily correspond to a smaller KL divergence. It is typically also the only metric considered for tasks where the groundtruth normalizing constant and samples from the target are not known, see, e.g., [Blessing et al., 2024]. For all other tasks, we additionally provide the error in estimating the normalizing constant (see Table 2 in the appendix).

Typo

Thank your for pointing out the typo in Theorem 3.4, which we have fixed in the revised version.

Thank you again for your comments. We are happy to answer any further questions you may have.

评论- Follow-up

2024-11-29

Dear Reviewer gLPG,

We would like to thank you once again for your review and feedback, which we believe makes this paper stronger. We've responded to all the issues that you mentioned, explaining the ELBO gap difference, clarifying our theoretical contributions, and showing the practical implications of our findings.

Thank you,

The authors

2024-12-03

Thank you for the response. Unfortunately, I still have my reservations on the novelty and on the intuition resulting from the results being put forward. My score remains unchanged.

评论- Response to all reviewers

2024-11-22

We thank all the reviewers for their comments. The suggestions have helped us improve the paper, and we have uploaded a revised version with the key changes highlighted in orange.

On the importance of theoretical results and their relevance of theory to the experiments

For diffusion models trained to maximize a variational bound on data log-likelihood, the denoising score-matching objective is equivalent to maximization of a continuous-time ELBO, equivalently, minimization of a KL divergence between reverse and forward path measures. The implications of this fact for the ability to train a diffusion model in one discretization (or with a continuous time parameter) and sample it in another are well understood starting from Song et al., 2021a and Huang et al., 2021.

Until our work, a similar result for diffusion samplers of unnormalized densities, in particular ones based on "off-policy" divergences such as VarGrad and the GFlowNet-inspired losses, has not been known. In fact, most past work on diffusion samplers has ignored the fact that exact time-reversal with Gaussian transition kernels is not generally possible in discrete time. Instead, a discrete-time objective that cannot, in theory, be taken to 0 by any sampler is minimized, yet the consequences of the discretization error in training are not studied. The closest results we are aware of in this direction are Proposition E.1 in [Vargas et al., 2024], which concerns trajectory-level Radon-Nikodym derivative but not the other functionals involved in off-policy losses, and Proposition 9 in [Zhang et al., 2023], which establishes a connection between the limit of detailed balance and score matching, but only considers the $\sqrt{h}$ asymptotics for a fixed reverse process.

We substantially generalize these known results, both for the global (KL, second-moment) and local (detailed balance) divergences: We show that the discrete-time objectives asymptotically approach continuous-time objects and indicate the order of asymptotics (0th-order in $\Delta t$ for the trajectory-level divergences (Proposition 3.3) and 0.5th-order and 1st-order for the two results on detailed balance (Proposition 3.4)).

While it is not entirely unexpected that such convergences would hold, it is also not obvious a priori. For example, one could imagine that a sampler trained with $N$ discretization steps would acquire a bias, relative to the ideal continuous-time sampler, that depends on $N$ and scales as $O(\sqrt N)$ as $N\to\infty$ . We show that this does not happen. The proofs of our results, although they do not require any entirely new proof technique, are not trivial. They require careful application of stochastic calculus results: in particular, we are not aware of the method of proof of convergence of functionals (Proposition B.3) -- via weak convergence (Proposition 3.1) and convergence of Radon-Nikodym derivatives (Lemma B.7) -- being used in relevant literature. For instance, [Vargas et al., 2024] present an RND discretization in their setting (Proposition E.1), but the convergence as well as implications for divergence minimization are not shown.

These convergences imply that objectives evaluated with $N$ -step discretization, for different $N$ , are all approximating the same continuous-time object, which justifies training and inference with different numbers of time steps. If we did not have such convergence, there would be no reason to expect that a sampler trained with $N_{\rm train}$ and sampled with $N_{\rm eval}$ discretization steps would have bias approaching $0$ as $N_{\rm train},N_{\rm eval}\to\infty$ .

The experiments are designed to illustrate the practical implications of the fact that training and inference with different numbers of time steps is theoretically justified. The observations regarding the choice of training discretization, which allow to greatly reduce the computation cost of training, are an interesting empirical result in their own right that will be of interest to the growing community working on diffusion samplers. They also provide an example of a practicable use of (less expensive) local-time objectives, which so far had not been seen to scale well with long trajectories.

Thanks to your feedback, we added an additional discussion at the end of Section 3.

评论- Follow-up

2024-11-26

Dear reviewers,

We would like to thank you again for your valuable feedback on the paper.

The end of the discussion period is approaching. We’ve responded to all of your comments and suggestions individually and in the above comment. Could you please let us know if these have satisfactorily addressed your concerns? We look forward to hearing from you.

Thank you,

The authors.

AC 元评审

2024-12-09

The paper considers the training of the neural stochastic differential equations to sample from Boltzmann distributions. By drawing connections between the discrete-time policies and continuous-time diffusion, an asymptotic equivalence between them has been established. As pointed out by the reviewer K8pm, the main results (asymptotic convergence from Euler-Maruyama discretization to continuous-time SDEs) are standard, and most of the preliminary results derived in the paper are either known or standard extension of known results, the contribution of the paper is marginal.

Nevertheless, the work is still having a good potential as the authors have experimentally shown that non-uniform time steps (in particular the random one) can provide significant performance gain, but this observation does not have any theoretical support as the current paper only provides standard results on arbitrary discretization scheme. If theory can be derived to explain this observation. It will resolve the novelty issue from the reviewers.

Based on the current feedback, the paper is marginally below the standard of ICLR, we have to reject this paper.

审稿人讨论附加意见

Besides the relatively minor issues that have been addressed by the authors during the rebuttal. There are 2 main concerns raised by the reviewers.

The paper is hard to read.

The reviewer is not convinced by the authors. However, after reading the corresponding discussion during the rebuttal, the AC agrees with the authors that this is mainly due to a lack of expertise in the relevant area.

The paper lacks novelty and enough contribution as most of the results are standard.

After reading the discussion, the AC agrees with the reviewer on this point. See metareview.

Besides, there are also some minor questions such as why non-uniform discretization works better remain unresolved by the authors.

最终决定Reject

2025-01-22

Reject