PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差1.0
8
6
8
6
3.8
置信度
正确性2.8
贡献度2.8
表达3.0
ICLR 2025

Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-17
TL;DR

We obtain the first non-asymptotic complexity analysis of annealed Langevin Monte Carlo for sampling non-log-concave distributions.

摘要

We consider the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of $\widetilde{O}\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right)$ for the simple annealed Langevin Monte Carlo algorithm to achieve $\varepsilon^2$ accuracy in Kullback-Leibler divergence to the target distribution $\pi\propto{\rm e}^{-V}$ on $\mathbb{R}^d$ with $\beta$-smooth potential $V$. Here, ${\cal A}$ represents the action of a curve of probability measures interpolating the target distribution $\pi$ and a readily sampleable distribution.
关键词
MCMCAnnealed Langevin Monte CarloNon-log-concave samplingNon-asymptotic analysis

评审与讨论

审稿意见
8

The paper consider the problem of sampling from an unnormalized density that may be non-log-concave and multimodal using annealed MCMC. The authors present a non-asymptotic analysis of Annealed Langevin Monte Carlo, bypassing the commonly assumed log-concavity constraint in literature. The quantitative bounds derived are widely applicable and either comparable to or better than previous analyses for isoperimetry-free sampling methods.

优点

The paper provides, as far as I know, the first thorough and non-asymptotic analysis of the annealed Langevin Monte Carlo algorithm. The results obtained advance the understanding and analysis of annealed MCMC, potentially setting the stage for further research in this area. The intuition behind the algorithm and the proofs are well explained.

缺点

The discussion on the impact of the annealing schedule on the convergence rate is not sufficiently clear in the paper. How the choice of the annealing schedule influences the derived bounds? Specifically, in the example of mixture of Gaussians, λ(θ)(1θ)γ\lambda(\theta) \propto (1-\theta)^\gamma is used, how varying γ\gamma affects the action A\mathcal{A}? whether there is an optimal value of γ\gamma that could minimize the convergence rate?

问题

Please see the weakness part.

评论

Thank you for your appreciation for our work.

Weakness: How the choice of the annealing schedule influences the derived bounds? Specifically, in the example of mixture of Gaussians, λ(θ)(1θ)γ\lambda(\theta) \propto (1-\theta)^\gamma is used, how varying γ\gamma affects the action A\cal A? whether there is an optimal value of γ\gamma that could minimize the convergence rate?

Response: In this example, we obtain a uniform upper bound of the action A\cal A for all γ[1,O(1)]\gamma\in[1,O(1)]. We choose γ1\gamma\ge 1 as intuitively, when θ\theta is close to 11, the distribution πθ\pi_\theta is more diffucult to sample from, and we should move more slowly. We agree that for each problem, there should be an optimal value of the parameter γ\gamma that reaches the best sampling quality, yet studying the influence of γ\gamma on the action A\cal A is highly non-trivial. This will be the focus of our study in the future.

We hope that the response above could successfully address your concerns.

审稿意见
6

The paper proposes a novel analysis of Annealed LMC which uses a specific scheduling system. The latter allows to benefit from another analysis scheme which uses the action on curves instead of isoperimetric inequalities.

优点

The paper provides a concise analysis with a clever mathematical trick of using the action of the curve instead of the isoperimetry constants. Assuming gradient Lipschitzness of the potential function allows to easily (rejection-acceptance) sample from an initial distribution, which then allows to sample from the target distribution (non-convex, gradient-Lipschitz) using an annealing schedule for LMC.

缺点

  • Assumption 1 is not well discussed. One cannot verify beforehand whether the curve πθ\pi_{\theta} is AC. How should it be done? See the respective question in the section below.
  • Assumption 2 is not the classic smoothness assumption. Instead it is the gradient Lipschitz continuity, which is stronger. This should be explicitly mentioned in the paper.
  • The notation is slightly confusing. There are too many indices for π\pi and the usage is incoherent. The reading would be rather simplified, if the notation was more developed.
  • Simple experiments on sampling from a mixture of Gaussians using the proposed and existing methods would significantly boost the paper.
  • In the abstract of the paper, the authors claim that their analysis is non-asymptotic, however it is not true. Theorem 2 (main result) and related results are given with O~()\tilde{O}(\cdot).

Mathematical comments

  • The equation after line 1077 in the proof of Example 2 is incorrect. The density of the sum of two independent random variables(vectors) is the convolution of their densities rather than their product. Thus, the variable XX is not distributed as π^λ\hat{\pi}_{\lambda}. This serves as the cornerstone of the remainder of the proof, which means that the proof is incorrect.
  • The third line in the proof of Example 2 is incorrect. After recentering the Gaussians, the authors erase the terms exp(2λβλ+β)yi2\exp(-\frac{2\lambda \beta}{\lambda + \beta})\|y_i\|^2 with the \propto sign. Since we have different yiy_i's, the resulting mixture weights will be different from pip_i. Nevertheless, this could be fixed by replacing pip_i by some qiq_i's.
  • Paragraph starting at line 364 The authors motivate the use of annealing as the schedules start at a distribution π0\pi_0, which is easy to sample from. The reason for the easy sampling according to the authors is the strong-convexity of the negative log-density logπ0(x)=η(0)V(x)+λ0x2/2-\log\pi_0(x) = \eta(0)V(x) + \lambda_0\|x\|^2/2. The latter is true when λ0\lambda_0 is large enough.
  • Continuation of the previous point. Line 831 The strong convexity constant (λ0η0β)(\lambda_0 - \eta_0 \beta) might be negative, unless λ0\lambda_0 is chosen to be large.
  • Beginning of the proof of Theorem 1 vtv_t is chosen to satisfy the Fokker-Planck equation on line 294. However, later on line 301 it is chosen to minimize the norm vtL2(π~t)\|v_t\|_{L_2(\tilde{\pi}_t)} using Lemma 2. This is possible if vtv_t also generates π~t\tilde{\pi}_t. Is it possible to find such vtv_t? See the relevant question.

typos

  • line 956: cure -> curve

问题

  • Is the analysis using the action only available for annealing? Can we gain anything by applying this technique to LMC in the standard setting?
  • What are the estimates of the action parameter A\mathcal{A} for a more general class of distributions than the mixture of Gaussians?
  • Is it possible to guarantee that for certain target distributions and schedules Assumption 1 is satisfied? This is important, as it is not easy to manually check for every setting.
  • Is there always a vtv_t such that Xtπ~tX_t \sim \tilde{\pi}_t for all tt in (4) and it generates π~t\tilde{\pi}_t i.e. satisfies the Fokker-Planck equation?
评论

Thank you for your careful reading of our paper and your valuable, constructive feedback.

Weakness 1 and Question 3: One cannot verify beforehand whether the curve is AC. Is it possible to guarantee that for certain target distributions and schedules Assumption 1 is satisfied?

Response: Absolute continuity is a fairly weak regularity condition on the curve of probability measures. Loosely speaking, as long as the curve (πθ)θ[0,1](\pi _\theta) _{\theta\in[0,1]} has a density πθ(x)>0\pi _\theta(x)>0 that is jointly C1C^1 w.r.t both xRdx\in\mathbb{R}^d and θ[0,1]\theta\in[0,1], then it is AC.

For instance, we can intuitively prove the proposition above in one-dimension. Let Fθ(x)=xπθ(u)duF _\theta(x)=\int _{-\infty}^x\pi _\theta(u){\rm d}u be the c.d.f. of πθ\pi _\theta, which is strictly increasing, and its inverse Fθ1F _\theta^{-1} is well-defined. F1()F^{-1} _{\cdot}(\cdot) is also jointly C1C^1. We know from standard textbook of OT that W22(πθ,πθ+δ)=01Fθ+δ1(q)Fθ1(q)2dq.W _2^2(\pi _\theta,\pi _{\theta+\delta})=\int _0^1|F _{\theta+\delta}^{-1}(q)-F _\theta^{-1}(q)|^2{\rm d}q. The above quantity should be O(δ2)O(\delta^2) as δ0\delta\to0 due to the C1C^1 property, and hence the curve is AC.

Weakness 2: Assumption 2 is not the classic smoothness assumption. Instead it is the gradient Lipschitz continuity, which is stronger. This should be explicitly mentioned in the paper.

Response: We agree that smoothness has several different definitions, such as being C1C^1 or CC^\infty. The definition we used in this paper, the Lipschitz continuity of the gradient (see the "notations and definitions" part in Sec. 1), is also a common one that appears in many standard textbooks in optimization and sampling, such as Sec. 3.2 of Bubeck's Convex Optimization: Algorithms and Complexity and Chap. 4 of Chewi's Log-concave Sampling. We appreciate your comment for the potential ambiguity.

Weakness 3: There are too many indices for π\pi and the usage is incoherent.

Response: Sorry about the confusion the paper might have caused you. Throughout this paper, (πθ)θ[0,1](\pi _\theta) _{\theta\in[0,1]} is an AC curve bridging π0\pi _0, a simple distribution, and π1\pi _1, the target distribution, while (π~t=πt/T)t[0,T](\widetilde{\pi} _t=\pi _{t/T}) _{t\in[0,T]} is the curve after time-reparameterization. 0=θ0<θ1<...<θM=10=\theta _0<\theta _1<...<\theta _M=1 are the discrete time points on [0,1][0,1], and T=TθT _\ell=T\theta _\ell's are the discrete time points on [0,T][0,T], with h=T(θθ1)h _\ell=T(\theta _\ell-\theta _{\ell-1}) being the step size. We hope that this clarifies the usage of these notations.

Weakness 4: Simple experiments on sampling from a mixture of Gaussians using the proposed and existing methods would significantly boost the paper.

Response: Please see Sec. 6 for the experimental results, where we verified the findings in Ex. 2.

评论

Weakness 5: In the abstract of the paper, the authors claim that their analysis is non-asymptotic, however it is not true. Theorem 2 (main result) and related results are given with O~()\widetilde{O}(\cdot).

Response: "Non-asymptotic analysis" refers to providing explicit bounds on algorithmic performance for fixed problem parameters, such as dimension dd, accuracy ε\varepsilon, and smoothness parameter β\beta, rather than focusing on behavior in asymptotic regimes.

To better explain the difference between asymptotic and non-asymptotic results, consider the following two examples:

  1. Central limit theorem (CLT) implies that for i.i.d. copies X1,X2,...X _1,X _2,... of a random variable XX with mean μ\mu and variance σ2\sigma^2, the law of n(Xnμ)\sqrt{n}(\overline X _n-\mu) converges to N(0,σ2){\cal N}(0,\sigma^2) as nn\to\infty, where Xn\overline X _n is the mean of X1,...,XnX _1,...,X _n. Here, we only know the limiting behavior of the law of n(Xnμ)\sqrt{n}(\overline X _n-\mu) as nn\to\infty; for any finite nn, CLT does not imply the discrepancy (say, KL divergence, TV distance, etc.) between the law of n(Xnμ)\sqrt{n}(\overline X _n-\mu) and N(0,σ2){\cal N}(0,\sigma^2). Hence, this result is an example of asymptotic analysis.
  2. In optimization, we know that gradient descent has linear convergence for strongly convex and smooth function. More precisely, suppose ff is α\alpha-strongly-convex and β\beta-smooth with global minimizer xx _\star, then with step size 1β\frac{1}{\beta}, xnx2(1αβ)nx0x2\|x _n-x _\star\|^2\le\left(1-\frac{\alpha}{\beta}\right)^n\|x _0-x _\star\|^2. This result not only implies that as the number of iterate nn\to\infty, the iterate xnx _n converges to xx _\star, but also tells us how far we are at every finite step nn, and how many iterations are required to reach a point that is ε\varepsilon-close to the global minimizer, which is upper bounded by βαlogx0x2ε2\frac{\beta}{\alpha}\log\frac{\|x _0-x _\star\|^2}{\varepsilon^2}, i.e., O(βαlogx0xε)O\left(\frac{\beta}{\alpha}\log\frac{\|x _0-x _\star\|}{\varepsilon}\right). Hence, this result is an example of non-asymptotic analysis.

In Thm. 2, we establish the number of oracle calls required to obtain samples that are ε\varepsilon-accuracy in KL divergence, which is clearly an example of non-asymptotic analysis. The asymptotic notations O()O(\cdot) and O~()\widetilde O(\cdot) are only used to hide constant and logarithmic terms. We hope this clarifies the difference between asymptotic and non-asymptotic analysis.

Mathematical comment 1: The equation after line 1077 in the proof of Example 2 is incorrect.

Response: Thank you for your careful reading of our proof and for pointing out this potential issue. However, upon careful review, we believe that the proof as presented in the paper is correct. Here, π^λ=i=1NpiN(βλ+βyi,1λ+βI)\widehat{\pi} _\lambda=\sum _{i=1}^N p _i{\cal N}\left(\frac{\beta}{\lambda+\beta}y _i,\frac{1}{\lambda+\beta}I\right). To sample from such a mixture of Gaussians with the same covariance, we can first sample the means βλ+βyi\frac{\beta}{\lambda+\beta}y _i w.p. pip _i, then convolute it with N(0,1λ+βI){\cal N}\left(0,\frac{1}{\lambda+\beta}I\right). Our constructions of XX is formed by picking the mean and convoluting it with Gaussian, which should be correct.

Mathematical comment 2: The third line in the proof of Example 2 is incorrect.

Response: Please note that we have assumed all yiy _i's have the same norm rr in this example, and hence all exp(2λβλ+β)yi2\exp\left(-\frac{2\lambda \beta}{\lambda + \beta}\right)\|y _i\|^2's are the same and can be cancelled.

Mathematical comments 3 and 4: Paragraph starting at line 364. [...] The latter is true when λ\lambda is large enough. Line 831 The strong convexity constant (λ0η0β)(\lambda _0 - \eta _0 \beta) might be negative, unless λ0\lambda _0 is chosen to be large.

Response: Your observation is correct. In Lem. 4, we have proved that with λ0=η0dβ\lambda _0 = \eta _0d\beta, the complexity of sampling from π0\pi _0 is O~(1)\widetilde{O}(1).

评论

Mathematical comment 5 & Question 4: Beginning of the proof of Theorem 1. [...] Is there always a vtv _t such that Xtπ~tX _t \sim \tilde{\pi} _t for all tt in (4) and it generates π~t\tilde{\pi} _t i.e. satisfies the Fokker-Planck equation?

Response: The logic here is as follows. We first choose any vector field (vt)(v _t) s.t. it generates (π~t)(\widetilde\pi _t). Such vector field exists given the absolute continuity of (π~t)(\widetilde\pi _t), see, e.g., Thm. 8.3.1 of Ambrosio et al., 2008 or Thm. 5.14 of Santambrogio, 2015. By the Fokker-Planck equation, this guarantees that Xtπ~tX _t\sim\widetilde{\pi} _t for all tt in (4). Girsanov theorem implies that KL(PQ)=140TvtL2(π~t)2dt{\rm KL}(\mathbb{P}\|\mathbb{Q})=\frac{1}{4}\int _0^T\|v _t\| _{L^2(\widetilde\pi _t)}^2{\rm d}t. Using Lem. 2, among all vector field (vt)(v _t) that generates (π~t)(\widetilde{\pi} _t), we may choose the one that minimizes vtL2(π~t)2\|v _t\| _{L^2(\widetilde\pi _t)}^2, which implies that KL(PQ)=A4T{\rm KL}(\mathbb{P}\|\mathbb{Q})=\frac{\cal A}{4T}.

Questions 1: Is the analysis using the action only available for annealing? Can we gain anything by applying this technique to LMC in the standard setting?

Response: We think that it is not applicable to LMC, as we are not running LD with a changing target distribution, and thus there is no action A\cal A. However, this approach based on action may be applied to the study of diffusion model and classifier-free guidance, which we will study in the future.

Questions 2: What are the estimates of the action parameter A\cal A for a more general class of distributions than the mixture of Gaussians?

Response: The estimate of A\cal A for more general target distributions is hard to obtain, as the Wasserstein distance does not have close form in general. One possible way is to leverage Lem. 3, as long as we have upper bounds on the LSI/PI constant. Yet as Ex. 1 suggests, the obtained upper bound may be loose. This problem will be the focus of our future study.

We sincerely appreciate the time and effort you have dedicated to providing thoughtful feedback. We hope that our response has successfully addressed your concerns.

评论

Weakness 1: The condition is indeed mild. However, for the sake of completeness, the theory should include a claim justifying why this assumption holds a priori. This claim needs to be explicitly stated and supported with a proof.

Weakness 3: Your explanation clarifies the notation well. Please include this discussion in the revised version for better readability and completeness.

Weakness 4 (Extended): The experiments do not compare the proposed method with existing approaches. Sampling from mixtures of distributions, such as those that can be addressed with LMC, is relatively straightforward (see Section 2.3 of [1]). Therefore, the paper should evaluate its method in more challenging scenarios than Gaussian mixtures.

Additionally, the ALMC results depend on the action parameter, which may become very large in the case of mixture distributions. Given that Gaussian mixtures are "easy" to sample from using Langevin methods, comparing the proposed approach with standard LMC under a global LSI condition is not particularly informative. While the proposed method might remain relevant in more general cases, the behavior of the action parameter under these conditions still needs thorough investigation.

Weakness 5: I understand your point, and I agree with it. What I intended to convey is that deriving convergence rates up to a constant would enhance the precision of the results.

Additional Comments: The rest of my concerns were adequately addressed, and I appreciate the effort made to clarify these points. As a result, I have increased my score. However, I still believe the paper requires further improvements, particularly in its experimental validation and theoretical completeness.


Reference:
[1] Dalalyan, Arnak S., and Avetik Karagulyan. "User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient." Stochastic Processes and their Applications 129.12 (2019): 5278-5311.


评论

We would like to sincerely thank you for raising the rating and we appreciate your kind suggestions for improvement.

Several slight modifications have been made in the revised manuscript: we have included a new Lem. 4 in App. A, which is a formal statement of a sufficient condition for absolute continuity in the rebuttal, and have included the discussion on smoothness as a footnote on page 3.

For weakness 5, the complexity bound can actually written as O(dβ2A2ε6)+O~(1)O\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right)+\widetilde O(1), where O(dβ2A2ε6)O\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right) is the dominant complexity for running ALMC, and the O~(1)\widetilde O(1) part is for sampling from π0\pi_0. It is actually O(1)O(1) when η0=0\eta _0=0, and O(1logη0βR2d2)O\left(1\vee\log\frac{\eta _0\beta R^2}{d^2}\right) otherwise (see Lem. 5). As the log\log part is in general small, this complexity can be absorbed in O(dβ2A2ε6)O\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right).

Further experimental validation and theoretical completeness on more general non-log-concave distributions will be the focus of our future study. Thank you again for your valuable comments.

审稿意见
8

The paper proves new bounds on efficacy of Annealed Langevin MC by creating an SDE which approximates the target and then analyzing the discretization error introduced by simulating it on a finite machine

优点

Computational bounds are generic- a smoothness constraint for target distribution, rather than isoperimetry assumptions.

Explanations are clear and well-illustrated, suiting a broad generalist audience by mixing intuition-building and commentary.

The direct impact of this paper is unclear, but the methods it introduces are interesting contributions

缺点

The paper shares the weakness of most such oracle inequalities, which is that they reassure us of convergence of an algorithm under assumptions that are hard to verify on a particular problem

Note that I am not expert in all the components of the authors' proof, and have taken some steps on faith.

问题

What does β\beta-smoothness imply about the target distribution? Can we know this in general, especially where the density arises from a posterior? It seems like this quantity would be hard to calculate for even tractable likelihoods, let alone general posterior densities.

评论

Thank you for your appreciation of our work and instructive feedbacks.

Weakness: The paper shares the weakness of most such oracle inequalities, which is that they reassure us of convergence of an algorithm under assumptions that are hard to verify on a particular problem.

Response: We agree that verifying these assumptions can be challenging in specific applications. However, our goal in establishing these conditions is to provide a broadly applicable theoretical framework that characterizes the behavior of annealed sampling in a general setting. These assumptions are intended to capture typical scenarios that arise in sampling from complex, multimodal distributions, allowing our results to extend to a wide range of problems.

Question:

  1. What does β\beta-smoothness imply about the target distribution?

    Response: Consider a target distribution πeV\pi\propto{\rm e}^{-V}, where the potential VV is β\beta-smooth. By definition, it means βI2VβI-\beta I\preceq\nabla^2V\preceq\beta I, which is equivalent to

    \begin{cases}V(y)\le V(x)+\left\langle\nabla V(x),y-x\right\rangle+\frac\beta2|y-x|^2\\ V(y)\ge V(x)+\left\langle\nabla V(x),y-x\right\rangle-\frac\beta2|y-x|^2\end{cases}

    Intuitively, this means the potential can be upper and lower bounded by quadratic functions at every point, meaning that it does not have regions of extremely high or low curvature. This assumption on the regularity of the target distribution aids in controlling the convergence of ALMC and in establishing the oracle complexity bounds.

  2. Can we know this in general, especially where the density arises from a posterior? It seems like this quantity would be hard to calculate for even tractable likelihoods, let alone general posterior densities.

    Response: Our paper focuses on the theoretical analysis of sampling, where we assume the potential is smooth to simplify the problem. In practice, the smoothness of the target distribution's potential requires case-by-case verification, and is calculable in some target distributions.

    For instance, consider the Bayesian logistic regression problem: the parameter has prior distribution βN(0,η2I)\beta\sim{\cal N}(0,\eta^2I), and the likelihood is YX,βBernoulli(σ(βTX))Y|X,\beta\sim{\rm Bernoulli}(\sigma(\beta^{\rm T}X)), where σ(t)=11+et\sigma(t)=\frac{1}{1+{\rm e}^{-t}} is the sigmoid function. Then the posterior distribution given paired data points D={(xi,yi)}i=1,2,...,N{\cal D}=\{(x_i,y_i)\}_{i=1,2,...,N} is p(βD)exp(β22η2iβTxi(1yi)ilog(1+eβTxi))=:exp(V(β)).p(\beta|{\cal D})\propto\exp\left(-\frac{\|\beta\|^2}{2\eta^2}-\sum_i\beta^{\rm T}x_i(1-y_i)-\sum_i\log(1+{\rm e}^{-\beta^{\rm T}x_i})\right)=:\exp(-V(\beta)). It is easy to verify that 2V(β)=1η2IixixiTeβTxi(1+eβTxi)2.\nabla^2V(\beta)=\frac{1}{\eta^2}I-\sum_ix_ix_i^{\rm T}\frac{{\rm e}^{\beta^{\rm T}x_i}}{(1+{\rm e}^{\beta^{\rm T}x_i})^2}. So for any unit vector uu, uT2V(β)u=1η2i(uTxi)2eβTxi(1+eβTxi)2u^{\rm T}\nabla^2V(\beta)u=\frac{1}{\eta^2}-\sum_i(u^{\rm T}x_i)^2\frac{{\rm e}^{\beta^{\rm T}x_i}}{(1+{\rm e}^{\beta^{\rm T}x_i})^2}, which implies uT2V(β)u1η2u^{\rm T}\nabla^2V(\beta)u\le\frac{1}{\eta^2} and uT2V(β)u1η2i(uTxi)21η2ixi2u^{\rm T}\nabla^2V(\beta)u\ge\frac{1}{\eta^2}-\sum_i(u^{\rm T}x_i)^2\ge\frac{1}{\eta^2}-\sum_i\|x_i\|^2. Hence, VV is smooth with constant max(1η2,ixi21η2)\max\left(\frac{1}{\eta^2},\sum_i\|x_i\|^2-\frac{1}{\eta^2}\right).

    We agree that verifying smoothness may be hard in some general cases, let alone computing the constant β\beta. Nevertheless, this assumption is common in non-asymptotic analyses, as it allows us to develop a clearer understanding of the algorithm’s behavior under idealized conditions, even if such conditions may be approximations in practical scenarios.

We hope that our response has successfully addressed your concerns.

审稿意见
6

The paper studies the convergence of annealed Langevin algorithms for general, possibly non-log concave, target distributions. The authors propose using an argument based on Girsanov's theorem to bound the distance between a reference process and a Langevin SDE that is based on an annealing schedule of target distributions, bridging the target to a base distribution. The authors also consider a practical implementation of the continuous time SDE, based on the exponential integrator scheme.

优点

The authors consider an interesting problem, which is actively researched. Most of the research has focused on obtaining convergence bounds for standard Langevin algorithms imposing strong assumptions on the target distribution, such as log-concavity or log-Sobolev inequalities. The approach in this paper seems original, since it differs from strategies that are typically used in the literature as explained by the authors. Moreover, it gives convergence bounds in KL for an annealed Langevin algorithm, for which few results are available, and under "weak" conditions. Overall, the paper is clear enough to follow what the authors are doing and has a cohesive story.

缺点

  • The authors consider a specific annealing schedule and it is not clear if their results give any kind of clarity on how the schedule affects the overall complexity of the algorithm.

  • In several occasions I would have appreciated clearer explanations. For instance, why considering the exponential integrator scheme instead of an Euler scheme? Why considering that specific annealing schedule? It is also not completely clear why one finds the topics in Section 2.4 at a first read. It would be beneficial to convince the reader that these mathematical concepts are needed, also giving an idea of how these are useful in what follows.

  • The sketch of the proof of Theorem 2 is not very clear to me and I find it should be heavily improved. It seems the authors compute the KL between two processes, neither of which is the discretisation scheme. Then the notation in the proof in the appendix seems inconsistent, denoting by \nu^{ALMC} the measure of another process. As I motivate in the Questions section below, I struggle to see how the authors use Lemma 1 in the proof of this result. This should be clarified if the paper is to be considered for acceptance.

  • the contribution is in general interesting, but I would have liked to find analyses such as some (simple) numerical simulations showing that the convergence of the algorithm is for instance not exponential in the dimension.

  • Table 1 gives an overview of results available in the literature, but these are for various algorithms which are denoted by corresponding acronyms. This is not a very clear comparison in my opinion, and it would be nice to have a short explanation at the differences between these algorithms.

  • the intuitive explanations in lines 318-323 are rather obvious and are not really clarified by the results in the paper. This should be clear to the reader to avoid confusion.

问题

  • Theorem 2 uses Lemma 1, that is a consequence of Girsanov's theorem. As far as I understand it uses it to bound the KL between the SDE in Equation (6) with the "true" SDE, that is Equation (4). However, the SDE (6) clearly does not fit into Lemma 1 of the paper, since a term in the drift depends on a past state. The authors should explain their reasoning and not just refer to Lemma 1, which is not general enough.

  • in Lemma 2, in what sense do the authors say the equality is "reachable"?

  • in lines 373-374 the authors state that the Euler scheme is "non-optimal", without a supporting reasoning. What are their reasons for writing that?

  • In Example 1, the authors say that \rho_1 is "readily sampleable". Is it then the case that \rho_0 is a simple distribution that can be sampled? Here the notation is not clear yet. Also, in point 2. of the example, are the LSI and PI constants exponentially dependent on d for the distribution \rho_t for any t? In general, this example could be made clearer.

评论

Thank you for your careful reading of our paper and your valuable, constructive feedback.

Weakness 1: The authors consider a specific annealing schedule and it is not clear if their results give any kind of clarity on how the schedule affects the overall complexity of the algorithm.

Response: The continuous-time analysis we presented in the paper is general and applies to any AC interpolation. In practice, the affect of annealing schedule to the complexity is a crucial question.

In Lem. 3(1), we showed that for fixed ρ0\rho _0 and ρ1\rho _1, the curve ρ\rho that has the minimial action A=01ρ˙t2dt{\cal A}=\int _0^1|\dot\rho| _t^2{\rm d}t is the constant-speed Wasserstein geodesic. However, this does not have close form except in some trivial cases such as Gaussian distribution.

To realize constant-speed Wasserstein geodesic, one may do time-reparameterization on the Wasserstein gradient flow. For example, consider the OU process dYt=Ytdt+2dBt{\rm d}Y _t=-Y _t{\rm d}t+\sqrt2{\rm d}B _t, Y0πY _0\sim\pi (target distribution) and YtptY _t\sim p _t. The OU process is the Wasserstein gradient flow of the functional KL(N(0,I)){\rm KL}(\cdot\|{\cal N}(0,I)), and after time-reparameterization of the curve (pt)t[0,)(p _t) _{t\in[0,\infty)}, one may obtain a curve close to the constant-speed Wasserstein geodesic between π\pi and N(0,I){\cal N}(0,I). This curve is widely used in diffusion model, but we do not have closed form of the scores logpt\nabla\log p _t in general cases. As our goal is to obtain the complexity of learning-free sampling algorithms, we do not study this approach in the paper. Instead, we focus on the curve πθexp(η(θ)Vλ(θ)22), θ[0,1]\pi _\theta\propto\exp\left(-\eta(\theta)V-\frac{\lambda(\theta)}{2}\|{\cdot}\|^2\right),~\theta\in[0,1] in this paper. The effect of general annealing schedule to the sampling complexity will be the focus of our future study.

Weakness 2:

  1. Why considering the exponential integrator scheme instead of an Euler scheme?

    Response: Since exponential integrator scheme reduces discretization error compared with Euler scheme. Recall that the SDE of ALD is dXt=(η(tT)V(Xt)λ(tT)Xt)dt+2dBt, t[0,T].{\rm d}X _t=\left(-\eta\left(\frac{t}{T}\right)\nabla V(X _{t})-\lambda\left(\frac{t}{T}\right)X _t\right){\rm d}t+\sqrt{2}{\rm d}B _t,~t\in[0,T]. Assume the discretization time points are 0=T0<T1<...<TM=T0=T _0<T _1<...<T _M=T. On [T1,T][T _{\ell-1},T _\ell], the Euler discretization scheme regards the linear part in the drift term λ(tT)Xt-\lambda\left(\frac{t}{T}\right)X _t as unchanged, but actually the integral of λ(tT)Xt-\lambda\left(\frac{t}{T}\right)X _t on [T1,T][T _{\ell-1},T _\ell] can be computed in closed form, and by doing this, the exponential integrator scheme reduces discretization error. This is also a standard approach in accelerating the sampling of diffusion models.

  2. Why considering that specific annealing schedule?

    Response: This specific family of interpolation curves is a popular choice in practice (see the last paragraph of section 5.1 for discussion). Moreover, another reason for choosing this specific family of interpolation curves is that the score functions logπθ\nabla\log\pi _\theta have closed form, and therefore, unlike many other curves such as the Gaussian convolution curve πθ=πN(0,σ2(θ)I)\pi _\theta=\pi*{\cal N}(0,\sigma^2(\theta)I), we do not need further training for approximating the score.

  3. It is also not completely clear why one finds the topics in Section 2.4 at a first read.

    Response: Section 2.4 provides some essential technical tools in optimal transport for establishing our results. For instance, Lem. 2 is used in the proof of Thms. 1 and 2 to minimize the integral of squared L2L^2 norms, and the action will turn out to be a crucial property of the annealing curve (see, e.g., Ex. 2). Lem. 3 is on the relationship between action and isoperimetric inequalities, which sets this approach apart from the traditional analysis based on LSI or PI. We have added several linking sentences to inform the readers of the purpose of introducing these concepts and properties.

评论

Weakness 3: The sketch of the proof of Theorem 2.

  1. The authors compute the KL between two processes, neither of which is the discretisation scheme.

    Response: Please refer to Figure 1 for the illustration. The two processes are: the reference SDE (Eq. 4) with path measure P\mathbb{P}, which is a continuous-time process, and the ALMC SDE (Eq. 6) with path measure Q\mathbb{Q}, which is a discretized process, although we write it in the form of an SDE. The explicit update form for practical use is detailed in Alg. 1.

  2. Then the notation in the proof in the appendix seems inconsistent, denoting by νALMC\nu^{\rm ALMC} the measure of another process.

    Response: νALMC\nu^{\rm ALMC} is defined as the distribution of XTX _T in the ALMC SDE (Eq. 6), i.e., the marginal distribution of the path measure Q\mathbb{Q} at final time TT.

  3. I struggle to see how the authors use Lemma 1 in the proof of this result.

    Response: Please see the response of Question 1.

Weakness 4: I would have liked to find analyses such as some (simple) numerical simulations showing that the convergence of the algorithm is for instance not exponential in the dimension.

Response: Please see Sec. 6 for the experimental results, where we studied the complexity w.r.t. rr, the same norm of the means. As discussed in the respoonse of question 4 below, in this case we show an improved bound on rr instead of dd, which we have corrected in the paper.

Weakness 5: Table 1 [...] It would be nice to have a short explanation at the differences between these algorithms.

Response: Thanks for your advice. We have modified the paper and added the discription of the algorithms in Sec. 1, before the table.

Weakness 6: The intuitive explanations in lines 318-323 are rather obvious and are not really clarified by the results in the paper. This should be clear to the reader to avoid confusion.

Response: In Thm. 1, we have shown that for any given AC curve of probability measures (πθ)θ[0,1](\pi _\theta) _{\theta\in[0,1]} and a sufficiently long time TT, by reparametrizing the curve and running ALD from time 00 to TT, then the KL divergence from the target distribution π=π1\pi=\pi _1 to the obtained distribution νALD\nu^{\rm ALD} at time TT is A4T\le\frac{\cal A}{4T}. This shows that with longer TT, we are moving more slowly and the discrepancy between the target and the obtained distribution diminishes. This suggests the intuitive explanations in this paragraph.

Question 1: The SDE (6) clearly does not fit into Lemma 1 of the paper, since a term in the drift depends on a past state.

Response: Thank you for the advice. The Girsanov theorem also works for the case where the process YY is adaptive (not necessarily Markov). More specifically, consider a probability space (Ω,F,(Ft)t0,P)(\Omega,{\cal F},({\cal F} _t) _{t\ge0},\mathbb{P}) on which W=(Wt,Ft)t0W=(W _t,{\cal F} _t) _{t\ge0} is a standard dd-dimensional Brownian motion. We say that Y=(Yt,Ft)t0Y=(Y _t,{\cal F} _t) _{t\ge0} is adaptive w.r.t. the filtration Ft{\cal F} _t if YtY _t is a Ft{\cal F} _t-measurable random vector for all t0t\ge0. For instance, in our paper, the process YY may take the differential form dYt=bt(Yt)dt+2dBt{\rm d}Y _t=\color{red}{b _t(Y _{t _-})}{\rm d}t+\sqrt{2}{\rm d}B _t, where ttt\mapsto t _- is a piecewise constant function with ttt _-\le t, corresponding to the time point of discretization. In this case, the Girsanov theorem has the form KL(PXPY)=KL(μν)+14EXPX0Tat(Xt)bt(Xt)2dt.\operatorname{KL}(\mathbb{P}^X\|\mathbb{P}^Y)=\operatorname{KL}(\mu\|\nu)+\frac{1}{4}\mathbb{E} _{X\sim\mathbb{P}^X}\int _0^T\|a _t(X _t)-\color{red}{b _t(X _{t _-})}\|^2{\rm d}t. For more details, please check Thm. 3.2.6 of the book Log-concave Sampling. We have modified the paper accordingly.

Question 2: In Lemma 2, in what sense do the authors say the equality is "reachable"?

Response: This means for any AC curve of probability measures (ρt)t[a,b](\rho _t) _{t\in[a,b]}, there exists a vector field (vt)t[a,b](v _t^*) _{t\in[a,b]} that generates (ρt)t[a,b](\rho _t) _{t\in[a,b]}, such that ρ˙t=vtL2(ρt)|\dot\rho| _t=\|v _t^*\| _{L^2(\rho _t)} for Lebesgue-a.e. t[a,b]t\in[a,b]. We have modified the statement of this lemma accordingly, and hope that this would be clearer to the reader.

Question 3: In lines 373-374 the authors state that the Euler scheme is "non-optimal", without a supporting reasoning. What are their reasons for writing that?

Response: Here, by "non-optimal", we are expressing that there exists a better discretization scheme of the SDE 3, in the sense that the discretization error can be reduced. The reason for non-optimality is explained in the paragraph after it, and the key reason is that the integral of the linear term can be computed in closed form.

评论

Question 4:

  1. In Example 1, the authors say that ρ1\rho _1 is "readily sampleable". Is it then the case that ρ0\rho _0 is a simple distribution that can be sampled? Here the notation is not clear yet.

    Response: In this example, ρ0\rho _0 can be any general target distribution (which may not be easy to sample from), and ρ1\rho _1 is the convolution of ρ0\rho _0 and a Gaussian noise with high variance, which is easier to sample from (i.e., readily sampleable). As the noise level 2S2S is large enough, samples from N(0,2SI){\cal N}(0,2SI) can serve as approximate samples from ρ0\rho _0. Note that it is different from the later setting where π0\pi _0 is a distribution easy to sample from, and π1\pi _1 is the target.

  2. Are the LSI and PI constants exponentially dependent on dd for the distribution ρt\rho _t for any tt?

    Response: Sorry about the mistake we made here: after carefully checking the reference, we found that the dimensional dependence may not be exponential; however, the dependence of the maximum distance between the modes should be exponential in general. We have modified Ex. 1 and its proof accordingly.

    To illustrate this, consider a simpler case where there are only 2 Gaussian components: ρ=12N(0,σ2I)+12N(y,σ2I)\rho=\frac12{\cal N}(0,\sigma^2I)+\frac12{\cal N}(y,\sigma^2I). Schlichting, 2019 has shown that CPI(ρ)CLSI(ρ)σ22(ey2/σ2+3)C _{\rm PI}(\rho)\le C _{\rm LSI}(\rho)\le\frac{\sigma^2}{2}\left({\rm e}^{\|y\|^2/\sigma^2}+3\right), while Dong & Tong, 2022 has shown that when yσd\|y\|\gtrsim\sigma\sqrt d, CLSI(ρ)CPI(ρ)σ4y2exp(Ω(y2σ2))C _{\rm LSI}(\rho)\ge C _{\rm PI}(\rho)\gtrsim\frac{\sigma^4}{\|y\|^2}\exp\left(\Omega\left(\frac{\|y\|^2}{\sigma^2}\right)\right). Hence we conclude that the LSI/PI constants of ρt\rho _t would depend on the maximum distance between the modes exponentially in this setting when tt is close to 00, but the dimensional dependence might not be exponential.

We sincerely appreciate the time and effort you have dedicated to providing thoughtful feedback. We hope these our response aligns with your expectations.

评论

I thank the authors for their clarifications and modifications. I have changed my grade to 6.

评论

Dear reviewer, thank you for raising the score! We would like to know if there‘s any additional concern that you still have regarding this work, and we would be eager to address them thoroughly.

评论

We would like to extend our sincere gratitude to all the reviewers for their insightful comments and valuable feedback. We are pleased that the reviewers recognized the importance of the annealed sampling problem addressed in the paper (reviewer 4z2B), appreciated the novelty of our technical contributions (reviewers 4z2B, KRFS, WqFj, WbhJ), and commended the clarity of our presentation (reviewers 4z2B, KRFS, WbhJ). Our analysis introduces a novel framework for understanding annealing algorithms through the action of the interpolation trajectory, setting our work apart from existing bounds by avoiding dependence on isoperimetric inequalities. We believe this framework is going to significantly advance the understanding of annealed sampling algorithms and pave the way for future research directions, including the design of annealing schedules.

While all reviewers agreed on the novelty of our approach, two primary concerns were raised:

  1. The need to improving the writing in several parts of the paper to enhance both accuracy and readability for a general audience.
  2. The lack of simple numerical results to validate the theoretical findings presented in the paper.

In our revised manuscript, we have made substantial improvements to address these concerns, providing clarifications for all raised questions. Notable changes in the main paper are marked red\color{red}{\rm red} and enumerated below:

  • We have added descriptions of the algorithms compared in Tab. 1.
  • The statement of Girsanov's theorem (Lem. 1) has been modified to include cases where the drift terms in the SDE depend on the past state.
  • We have clarified the statement "the equality is reachable" in Lem. 2. and expanded the discussion on the motivation for introducing OT concepts in Sec. 2.4.
  • An error in Ex. 1 related to the dimensional dependence of LSI/PI constants has been corrected, along with adjustments to the corresponding proof in App. A.
  • We added a footnote in the proof of Thm. 1 for rigor.
  • In Sec. 6, we added a numerical experiment to verify the findings in Ex.2, and showed that the dependence of the number of iterations required to achieve a certain accuracy in KL divergence depends polynomially on rr.
  • Finally, we slightly modified the layout of the paper to fit the additional content.

We have also addressed all concerns in the individual responses below. We hope that the changes we have made reflect our commitment to improving the quality of our work and that the revised manuscript meets your expectations.

AC 元评审

This paper provides a non-asymptotic analysis of annealed MCMC. Authors establish an oracle complexity of a simple variant of annealed Langevin Monte Carlo and provide convergence guarantees for this algorithm to the target in KL divergence. Target is assumed to have a , smooth potential and the rate is in terms of action of a curve of measures that interpolates the target, which is an intriguing parameter.

This paper was reviewed by four reviewers the following Scores/Confidence: 8/4, 8/3, 6/4, 6/4. I think the paper is studying an interesting topic and the results are relevant to ICLR community. The following concerns were brought up by the reviewers:

  • It is not clear if the rate is tight (most likely it is not). As is, the result seems to be another upper bound on sampling from a class of targets.

  • Assumptions are not discussed in detail and compared with standard ones in the literature. This should be improved.

  • Even though this is a theoretical paper, a detailed experimentation section would make the paper much stronger.

  • minor typos were pointed by the reviewers. These should be carefully addressed. One reviewer requested detailed explanation of the integrator.

Authors should carefully go over reviewers' suggestions and address any remaining concerns in their final revision. Based on the reviewers' suggestion, as well as my own assessment of the paper, I recommend including this paper to the ICLR 2025 program.

审稿人讨论附加意见

Reviewer questions are thoroughly answered by the authors. The revision provides colour-coded updates.

最终决定

Accept (Poster)