Constrained Sampling with Primal-Dual Langevin Monte Carlo
We consider the problem of sampling from a target probability distribution whose density is know up to a normalization constant while satisfying a set of statistical constraints specified by general nonlinear functions.
摘要
评审与讨论
This paper aims to solve the constrained sampling problem via primal-dual method. The authors proposed a new sampling method PD-LMC and provide detailed convergence analysis. Several numerical experiments were conducted to verify the sampling method.
优点
The structure of paper is easy to follow. The authors discussed detailed background and related work. The authors also established solid convergece guarantee for their algorithms under different settings.
缺点
-
The primal-dual method for constrained sampling problem is not novel. Similar algorithms were also studied in [1]. I notice that PD Langevin in [1] requires the estimation of expectation with respect to sampling distribution. But it can easily addressed by running many particles in parallel. Therefore I take the theoretical analysis as the most important contribution of this paper. The weaknesses are listed as following.
-
In the theoretical analysis part, the authors didn't show the rate of violation of constraints, i.e. if satisfies constraints. And this leads to the following one.
-
In Theorem 3.3, when there are equality constraints, is supported on a low dimension manifold while is not, then is not well defined. Hence I doubt the correctness of Theorem 3.3. I suggest the authors should consider metric not depending on density ratio, like W2 distance or TV distance.
-
In the numerical experiments, the paper lacks comparison with other methods, e.g. mirror Langevin related methods for sampling on convex set; PD Langevin and Control Langevin in [1] in rate-constrained Bayesian models.
[1] Liu, Xingchao, Xin Tong, and Qiang Liu. "Sampling with trusthworthy constraints: A variational gradient framework." Advances in Neural Information Processing Systems 34 (2021): 23557-23568.
问题
Please see weakness part.
局限性
The limitations are pointed out in the paper.
We thank the reviewer for their positive comments on our results, particularly our theoretical guarantees. As the reviewer correctly notes, in contrast to [22] ([1] in the reviewer's comments), our analysis handles two levels of approximation: time- and space-discretization. We next address their questions.
Q1 Indeed, as we clearly acknowledge in the paper, Algorithms 1 and 2 can be seen as approximations of the PD Langevin from [22]. But there are substantial computational and theoretical differences between these methods.
The dual variable updates in the algorithms from [22] require computing an expectation over the current particle distribution [see (10)], which is intractable. While we agree that it can be approximated by running many particles in parallel, this would result in a substantially higher computational cost than the stochastic updates in our paper. Indeed, we show PD-LMC (Algorithm 1) converges using a single-particle approximation.
Furthermore, it is not clear from [22] how these approximation errors impact the convergence of the algorithms since it only provides guarantees for exact dual updates (computing expectations). In contrast, our theoretical analyses exactly characterize the bias introduced by approximation errors (Proposition D.1) and show that they do not affect the convergence in the convex setting (Theorem 3.3). This is confirmed by our experiments. These results hold for discrete-time, expectation-free algorithms.
That being said, we acknowledge that we could have better illustrated the relative performance of PD Langevin and our stochastic version PD-LMC. We have addressed this oversight and provided an illustration of the performance of both schemes as the number of LMC particles grows for the one-dimensional truncated Gaussian and the rate-constrained Bayesian model (see Fig. 1 and 2 in the pdf attached to the global response). This hopefully illustrates that one can achieve essentially the same sampling performance as PD Langevin at a considerably lower computational cost.
Q2 The reviewer has a point: our results do not directly analyze the decrease rate of the constraint violation. Yet, our theoretical guarantees show that PD-LMC converges to that is feasible by definition. Hence, Theorems 3.3 and 3.5 do address the issue of feasibility. Additionally, it is possible to use our results to explicitly control the constraints violation along iterations.
Consider the results in Theorem 3.3, namely (13): Since and , we obtain that where we used the -norm relation.
If is bounded by 1, the summands are bounded by . Using Pinsker's inequality, we therefore obtain If is -Lipschitz (and is -strongly convex), we have which implies The same arguments hold for .
Hence, our results can be used to control (in expectation) the constraint violation of ergodic averages along the PD-LMC trajectories. These behaviors are also observed in our experiments. For instance, Fig. 3 of the pdf attached to the global response shows the ergodic constraint slacks for the equality constraints of the Bayesian stock market problem (Fig. 2 in the paper). Notice that by the end of the simulation their values are within of zero. We thank the reviewer for raising this point and will include these corollaries and discussions in the camera-ready version.
Q3 We respectfully, but strongly disagree with this statement. We believe there may have been a misunderstanding arising from the fact that the constraints in (PI) are distribution constraints rather than constraints on the support of . We refer the reviewer to Section 2.2 for a detailed discussion of the differences between these constraint types. In our case, imposing a moment constraint (even as an equality constraint) does not result in being supported on a lower-dimensional manifold. Another way to see this is by noticing that [see Prop. 2.2(iv)]. Hence, for full-dimensional and finite [which is the case, see Prop. 2.2(ii)], will also be full-dimensional. For a concrete example, we refer the reviewer to the response to Q2 of reviewer 4nfu, where we derive an explicit solution for (DI) under an equality constraint.
Q4 We have included results for the Mirrored Langevin in the two-dimensional truncated Gaussian as well as the PD Langevin from [22] for the rate-constrained Bayesian problem (Figs. 4 and 2 respectively in the pdf attached to the global response). We note again that the latter requires an explicit integration that is intractable and that using LMC chains to approximate it reduces to the settings of Algorithms 1-2, whose convergence is guaranteed by our results (Theorems 3.3 and 3.5). As expected, we therefore obtain the same results albeit at a considerable increase in computational complexity.
Thanks for the feedback. Most of my concerns are addressed and I'm happy to raise the score. The rate of violation of constraints is an important proposition that should be added to the main text. I would like to mention that the proposed PD LMC algorithm is the same with PD Langevin in [22] when particle number , while the theoretical analysis in this paper is solid and novel.
The paper studies constrained sampling schemes. The objective function is the KL divergences with a target distribution, while the constraint set is given as some expectation equations or inequalities. The authors rewrite the problem into a saddle point formulation and study the Wasserstein gradient descent and L2 gradient ascent directions. This results in a particular Langevin dynamics with a dual ascent direction. The authors study the convergence analysis of the proposed algorithm. Several numerical examples demonstrate the effectiveness of the proposed method.
优点
The authors clearly explain the constrained optimization problems. They apply the primal-dual gradient descent-ascent algorithm based on the Wasserstein space to solve the optimization problem. Convergence analysis is presented using tools in optimal transport.
Sampling from a convex set is a very good application.
缺点
There is a lack of literature on generalized optimization problems in Wasserstein spaces.
Wang et al. Accelerated Information Gradient flow. Journal of Scientific Computing. 2022.
Wang et al. Information Newton's flow: second-order optimization method in probability space, 2020.
Tan et al. Noise-Free Sampling Algorithms via Regularized Wasserstein Proximals, 2023.
问题
-
For the time update, the authors perform the forward Euler time step for the primal and dual variables. Do the authors expect some advantages if one applies the proximal steps? What is the main difference in analyzing the primal-dual algorithm in Wasserstein space and the classical Euclidean space.
-
Can the authors provide some simple examples, such as Gaussian target distributions and linear moment constraints, to explain the main result? This could partially answer the first question.
局限性
There are no limitations.
We thank the reviewer for their enthusiastic opinion of our paper. We next address their questions one-by-one.
Weakness The reviewer has a point. We focused on regular Langevin dynamics and its literature, but this paper is indeed part of the large line of work of sampling as an optimization in the space of probability measures. It is therefore connected (or could be extended) to many other schemes. We appreciate the suggested references and will include them in the manuscript.
Q1.1 Since our main focus was on deriving and studying a computationally-friendly algorithm for the constrained sampling problem (PI), we did not consider proximal updates that are more expensive than the fully explicit Euler updates used in this paper (for both primal and dual variables). However, we acknowledge that investigating proximal steps is an interesting future direction and we believe that it could in fact lead to stronger theoretical guarantees (under additional assumptions) as is the case for gradient descent-ascent in Euclidean space (see, e.g., [40]). We will make note of this point in the revised version of the paper.
Q1.2 To a large extent, our approach is akin to that used to study any optimization algorithm/dynamical system: we construct a Lyapunov function to bound the duality gap [note from the saddle-point relation (8) that this is the right optimality measure]. The techniques used to bound this gap in Wasserstein space, however, are different.
Indeed, while the dual (DI) remains a (finite dimensional, non-smooth) optimization problem in Euclidean space, the primal (PI) is an (infinite dimensional, smooth) optimization problem in Wasserstein space. The proof must therefore account for the mixed nature of the Lagrangian. Additionally, Algorithm 1 performs stochastic updates in both the primal (step 3) and dual (steps 4-5). Hence, the potential used to update in step 3 is a random variable and we have to deal with the joint distribution . This is an important distinction with the traditional LMC or when the dual updates (steps 4-5) are performed using exact expectations as in [22]. To address these issues, we instead use conditional laws to obtain our guarantees in expectation over realizations of .
Q2 Consider a standard Gaussian target, i.e., , and the linear moment constraint , for . This can be cast as (PI) with and (no inequality constraints, i.e., ). Clearly, the solution of (PI) in this case is . What Prop 2.2 claims is that rather than directly solving (PI), we can solve (DI) to obtain a Lagrange multiplier such that for defined as in (5) (line 134).
We can show this indeed the case here by doing computations explicitly. Indeed, we have
Completing the squares we obtain
From the definition of the dual function in (7) (line 139), we can write (DI) explicitly as a ratio of normalizing factors, namely
Immediately, we obtain
Note that, as we mention in the text, the dual problem is indeed a concave program. To conclude, observe that as we indeed have
Our main results (Theorems 3.3 and 3.5) show that it is not necessary to first determine the Lagrange multipliers to then sample from . Indeed, the stochastic primal-dual Langevin methods in Algorithms 1--2 simultaneously determine and sample from . Additionally, they do so without explicitly evaluating any expectations.
Please do not hesitate to let us know if something is not clear. If this explanation indeed clarifies some of our duality derivations, we consider including it in the appendix as a warm-up example.
The authors carefully address my questions. I also like the example of Gaussian distributions. I suggest the publication of this paper.
In this work, the focus is on a constrained optimisation problem in the space of measures. Specifically, the goal is to obtain a distribution / samples from a distribution which is close in KL to a target distribution while also satisfying a set of statistical constraints. The paper discusses this somewhat atypical problem, and designs the PD-LMC method to generate approximate sample from such a distribution. Theoretical properties of PD-LMC are also presented, along with some numerical experiments to showcase the working of this method.
优点
Quality and Clarity
The paper reads well, with a clear description of the objective at hand and the necessary background to help the reader grasp the problem and the proposed solution. For someone used to the classical constrained sampling problem, the examples in Section 2.2 definitely helped contextualize the problem better, and is appreciated. The dual formulation which is central to this work is also adequately presented, which helps understanding the algorithm. It is interesting that the proposed method gets away without computing expectations and replacing them in a single particle.
Originality and Significance
The idea is novel, and looks like a clean way to adapt ideas from Euclidean optimization for the distributional constrained problem considered here. I cannot quite comment on the significance of this proposed method as I'm not familiar with this flavor of constrained sampling. Judging by the numerical experiments in section 4, the method appears to work well in low-dimensional setting.
缺点
The method is interesting and novel, but there is some loss of intuition for me, which I've tried to express as a question in Q2. Essentially, the proposed method performs two levels of approximations which while practically appealing, makes it hard to follow. I also have a minor issue with laws of and the expectations appearing in Theorem 3.3. Essentially, the concern is how there's no stochasticity on the LHS, and this is where my confusion with working with laws and samples is confusing to me.
问题
- Is the purpose of the -updates simply to approximate the discretisation in the steps 11b and 11c? This was particularly confusing to me because problem DI doesn't involve any , which is what is sought out to be solved.
- Following-up on the previous question, to understand better, the line of reasoning is: Eq. 9 would be what is ideal, but this requires knowing and expectations under this. So, [22] proposed Eq. 10 in the equality constrainted setting, which is extended to Alg 1 in this paper. Since is a tilted version of , what would a proposal like approximately computing the expectations in Eq. 9 using a high-accuracy sampler like MALA do (notwithstanding the sample waste)? I suppose something similar is being done in Algorithm 2?
- What is preventing the dual variables from blowing up in magnitude? In other words, how large can be / what rate does it grow at?
- Why are equality constraints disregarded for PD-LMC with LSI potentials? Since doesn't exist, why does it show up in Algorithm 2?
- Can you give a concrete example for when Assumption 3.4 is satisfied, i.e., with being a bounded perturbation, and a bound on ?
局限性
Not quite, it would be nice if the authors could comment on some limitations of their analysis / methods.
We thank the reviewer for their positive comments on our paper.
Weakness The reviewer has a point that to obtain a practical algorithm we perform stochastic updates in both the primal (step 3) and dual (steps 4-5), which complicates things. This in fact marks an important distinction with the analyses of the traditional LMC or when the dual updates are performed using exact expectations as in [22], in that we now have to handle the full joint distribution . To overcome this challenge, we work with conditional laws to obtain "in expectation" guarantees. That is why the bounds in Theorem 3.3 are not stochastic.
Indeed, note that (13) comes from the inequality (line 617, Appendix C) where is the conditional law of . Our proof analyzes the evolution of this conditional law [as in (18)] and then apply Jensen's inequality using the fact that , the law of (line 618). Due to space constraints, we did not include this technical discussion in the main text but we plan to do so in the final version.
Q1 Note that (DI) involves through its objective, the dual function from (7). But the reviewer is correct that, in contrast to (PI) that directly seeks , the dual problem (DI) seeks that can then be used to obtain by defined in (5)--(6) (line 134). In fact, finding and are equivalent [Prop. 2.2(iv)]. Hence, the goal of the -update is in fact twofold: sampling from , i.e., , and compute the [using the discrete versions of (11b)--(11c) in steps 4-5 of Algorithm 1].
Q2 The reviewer is correct. Computationally, (9) is intractable; (10) proposed by [22] is better, but still intractable (due to the expectations); but Algorithm 1 is practical and can be implemented without approximations. From the point-of-view of convergence, (9) performs "dual ascent" and would converge to ; the continuous-time counterpart of (10) was shown to converge in [22] (we extend these results in Theorem 3.3 by showing the discrete version itself converges under milder conditions on , as we discuss in line 247); and we prove that Algorithm 1 also converges (again, under milder assumptions).
The reviewer's proposal would indeed yield Algorithm 2 (with an extra Metropolis acceptance step). But while this approach has advantages in the non-convex case (Section 3.2), its benefits do not necessarily outweigh the increased computational cost. For instance, consider the sample mean estimate of the one-dimensional truncated Gaussian from Fig. 1 (main paper) as a function of the number of LMC iterations (step 3) taken in Algorithm 1 (see Fig. 1 in the pdf attached to the global response). I.e., for we have Algorithm 1 and for we follow the reviewer's proposal. Note that the additional computational cost does not improve convergence in this case (especially when considering the number of "LMC steps" which is proportional to the computational complexity).
Q3 This is hard to guarantee in general, even in the Euclidean convex case (see, e.g., [35]). As we mention in the paper (line 235), a common solution is to clip the iterates in Algorithm 1 by some upper bound on . Such bounds are well-known and can be obtained under a variety of constraint qualifications such as Assumption 2.1, although explicit bounds for equality constraints require additional assumptions (see, e.g., [26,35] or the more general [Gauvin, "A necessary and sufficient regularity condition to have bounded multipliers in nonconvex programming," 1977]). Alternatively, a decreasing "weight decay" regularization can be used (see, e.g., [62]). In some cases, it is even possible to directly bound the iterates (see Lemma D.3). For these reasons, we chose to present our results in the form of Theorem 3.3, which we find to be more applicable, and then comment on potential solutions after the theorem. We will expand our remark on this point in the camera-ready version.
Q4 The reviewer is correct: there is a typo in step 4 of Algorithm 2. We omit the equality constraints in the non-convex case to keep the analysis manageable. Deriving results similar to Lemma D.3 and D.4 for equality constraints requires additional assumptions (see response to Q3) and substantial derivations that we felt obscured the results. We note that it is sometimes possible to cast equality constraints as two tight inequalities, although this might lead to numerical issues. We will expand on this remark in the camera-ready version.
Q5 Let (-strongly convex) and (bounded). We therefore have . For , Assumption 3.4 holds with . As increases, decreases but for all finite . In fact, (see, e.g., [42, Prop. 5.1.6] or Theorem 1.1 in [Cattiaux and Guillin, "Functional inequalities for perturbed measures with applications to log-concave measures and to some Bayesian problems," 2022]). Since is finite (see bound in Lemma D.3), we could explicitly clip the iterates and get an a priori bound on . But even without any explicit limit, we show in Theorem 3.5 that is bounded for all .
Q3: I feel like this is an important point to cover. Specifically if the quantity increases faster than , then the analysis would yield a vacuous bound. In your method, 1) how would you obtain a bound, 2) how would you enforce a bound on these variables, and 3) how would this affect the guarantees for the method?
I have no further questions aside from the ones above.
We agree with the reviewer that these are important points and address them in the discussion after Theorem 3.3 in the current version of the manuscript (lines 235-244). There, we argue for the form of Theorem 3.3 rather than modifying Algorithm 1 because Theorem 3.3 also shows there exists a sequence of step sizes such that are bounded. Additionally, none of our experiments enforce bounds (they implement PD-LMC exactly as described in Algorithm 1). Yet, we always observe the dual variables converging (see, e.g., Fig. 2 in the manuscript), suggesting that pathological cases are not common. That being said, the theory would not be complete without addressing these cases, which we discuss in the remarks of lines 235-244. We expand our treatment of these points, including the discussion below, in the revised version of the manuscript. If any point remains unclear, we are happy to provide further details.
To the reviewer's points:
-
Bounds on : Under Assumption 2.1, we provide such a bound on in Lemma D.3. Explicitly, suppose there exists a such that , , and (Assumption 2.1). Then, by definition
for the dual function in (7). By strong duality, , which yields
A bound on , though more complicated, can be obtained under a similar constraint qualification (see, e.g., the general [Gauvin, "A necessary and sufficient regularity condition to have bounded multipliers in nonconvex programming," 1977]).
-
Enforcing the bound: Suppose . Suffices it to replace steps 4 and 5 in Algorithm 1 by
where . This is a common solution used in the Euclidean case (see, e.g., [35]).
-
How does this affect the guarantees? It does not. Indeed, note that proof of Theorem 3.3 is based on the fact that (17) is a Lyapunov function. This is proved in Lemma C.1. When bounding the Euclidean norm of the dual variables (line 642), the first step is realizing that the projection and are contractions, since and . Hence, we can use
to obtain (26). The proof then proceed as is and yields the exact same results as in Theorem 3.3, except that (13) [and similarly (14)] can now be rewritten as
I have no further questions, and would like to maintain my score. I think that parts 2 and 3 from above would benefit the exposition.
We thank the editors and reviewers for their time and for the positive comments on our paper.
In the sequel, we respond to each of their questions individually. Throughout our responses, we refer to references and equations as numbered in the submitted version of the manuscript. We also refer to figures contained in the pdf attached to this response as "pdf attached to the global response."
Please comment on the relation of your work to ...
Salim and Richtarik, Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm, NeurIPS 2020, https://papers.nips.cc/paper/2020/hash/2779fda014fbadb761f67dd708c1325e-Abstract.html
This seems relevant due to the primal-dual interpretation of Langevin described here. Moreover, the work directly addresses the issue of constrained sampling. This work is not cited, but seems very relevant. Is it not? If either case, I believe, this should be clearly explained here to the reviewers and in the paper.
Apologies for posting this so late; I hope you can still respond.
Reviewers: What do you think?
Best regards,
AC
I feel that the constrained setting in the submission deviates from the traditional constrained sampling setup in that the constraints are expressed as functionals, and the dual formulation extends the formulation in the Euclidean space to space of probability measures. The above paper appears more relevant to when the potential to sample from is a composite function (smooth and non-smooth parts). But I'm also interested to hear about the authors' thoughts (hopefully they can respond).
This manuscript and [Salim and Richtarik] use primal-dual methods in distinct ways. As reviewer d9K2 points out, they tackle the problem of sampling from composite potentials that can contain a non-smooth part restricting the support of the distribution. In contrast, we study (PI) which imposes statistical (moment) constraints on the distribution (see Table 1 and Section 2.2 for a detailed discussion on the differences between these constraint types). While (PI) can be used to tackle support constraints (as we show in Section 2.2 and illustrate in our experiments), it can also handle statistical restrictions for which proximal operators cannot be constructed (see examples in Section 2.2). Though our focus on computational aspects means we did not consider proximal updates in this paper, we do believe that studying proximal versions of PD-LMC is an interesting future direction (see more details in our response to reviewer 4nfu).
Additionally, [Salim and Richtarik] consider stochastic "primal" updates [see (3) in their paper] as in stochastic gradient Langevin [33]. In constrast, PD-LMC (Algorithm 1) uses stochastic gradients only for its dual updates (steps 4-5). Indeed, note from (6) that is a deterministic function that depends only on parameters of the sampling problem (namely, ). Hence, its evaluation does not involve computing any expectation. Combining stochastic gradient Langevin with PD-LMC is also an interesting research direction but outside the scope of this work (see comment line 198).
That being said, as with other papers mentioned in these reviews, [Salim and Richtarik] is certainly relevant related literature and we will definitely include these clarifications in the revised version of the manuscript.
The focus of this paper is on a constrained optimization problem in the space of measures. The goal is to obtain samples from a distribution which is close in KL to a target distribution while also satisfying a set of statistical constraints. The paper discusses this somewhat atypical problem, and the authors design the PD-LMC method to generate an approximate sample. Theoretical properties of PD-LMC are studied. Numerical experiments showcase the workings of this method.
All reviewers were positive about this work (scores 5, 5, and 7), indicating that the strengths outweigh the weaknesses. Some of the strengths of the paper include:
- The paper is well written and reads well
- The objective of the research and the background help the reader grasp the problem and the proposed solution
- The dual formulation central to this work is also adequately presented
- The idea seems to be novel, and looks like a clean way to adapt ideas from Euclidean optimization for the distributional constrained problem considered here
- The authors establish solid convergence guarantee for their algorithms in various settings
I fully expect the authors to address all the observed weaknesses in the camera-ready version of the paper. On this condition, based on the above reasons, I propose the paper to be accepted.
AC