Ito Diffusion Approximation of Universal Ito Chains for Sampling, Optimization and Boosting
Approximating General Markov Chains by Diffusion Processes without assuming Gaussian noise.
摘要
评审与讨论
This work considers a rather general and broad class of Markov chains, Ito chains that look like Euler-Maryama discretization of some Stochastic Differential Equation. However, I am not fully convinced of the importance of this work.
优点
This paper studies a rather general and broad class of Markov chains and studies the convergence rates.
缺点
- The theoretical results do not provide any new insight to the community.
- The analysis is based on standard techniques.
- The paper lacks a empirical study to verify the theoretical findings.
问题
None
伦理问题详情
None
Thank you for review. Let us address the concerns that you raised:
The theoretical results do not provide any new insight to the community.
Prior to our knowledge, our work is the first one to establish the following insights:
- Convergence in Wasserstein-2 of SGD diffusion under the weak assumption that -CLT holds, which in particular, satisfied if the noise satisfies , which in prior works appeared only in (Li et al., 2019) [38] (last row in Table 2), however in that work convergence was proved in the weak sense and assumed dissipativity of the drift, which is a restriction, that does not allow to apply to such to problems as a stochastic approximation or saddle-point optimization.
- We are the first to establish non-asymptotic convergence for the Stochastic Gradient Langevin Boosting algorithm, developed in [53]. All prior results around SGLD do not apply to that case, as they all assume the normality of the noise, which is not satisfied in that case. In the original work, authors were able to obtain only asymptotic first-order weak convergence results based on the almost a century-old work of Kushner J., "On the Weak Convergence of Interpolated Markov Chains to a Diffusion."
The analysis is based on standard techniques.
- All prior works on the convergence of Markov Chains to Diffusions assume at least distant dissipativity of the drift (that drift points to the origin outside of a ball), which simplifies their analysis, and most of the works assume normality of the noise or close to normal distribution, or establish convergence in a weaker than Wasserstein-2 sense. Techniques developed in our work, namely window coupling (see Eq. 4) and coupling via noise (see Eq. 3), are novel and non-standard. They both aim to **circumvent the lack of such assumptions on which the prior works were based. **
- Regarding standard results like the Girsanov theorem, it does not apply to our non-Markov process, which appears in our analysis due to our novel coupling approaches that procured a non-Markov process for which the Novikov condition fails. Thus, in Theorem 1, we showed how such a non-markovian process can be reduced to the standard via a non-standard chain of arguments developed in Section B.3., based on Lemma 10. Again, this is not only a non-standard technique but provides novel fundamental insights about the Girsanov theorem.
- As for the SGD diffusion case, our Lemma 4 and techniques used to prove it are crucial for establishing Wasserstein-2 convergence of the SGD diffusion, which is another novel development of our work, which is game-changing results for showing SGD diffusion bounds for Wasserstein-2, without which the final bound of Theorem 2, would've been trivial and no convergence established. Which is, again, new insight and new technique.
The paper lacks a empirical study to verify the theoretical findings.
Our work is purely theoretical and establishes explicit convergence bounds for many algorithms known for practicians to exhibit diffusion-like behavior, like SGLB, SGD, and any stochastic optimization. We merely unified them under one analysis done at the weakest possible assumptions. Such unification produced novel, previously unknown/sharper bounds of their convergence in Wasserstein-2 for algorithms like SGD and SGLB.
We are confused about what type of empirical study the Reviewer wants to see, as the Wasserstein-2 distance is non-constructive by its definition (see Section A), while it is of undoubtful importance for the field.
This paper studies a broad class of Markov chains called Ito chains, which allows for almost arbitrary isotropic and state-dependent noise instead of normal and state-independent one. The paper analyzes the coupling between Ito chains and its anagolous SDEs, and provides an upper bound for distance between laws of the two. Their results cover many of the known estimates. In particular, some techniques such as Window Coupling and an augmented version of Girsanov's Theorem are of independent interest.
优点
Originality: The originality is high. The authors study a quite general class of Markov chains and analyze the upper bound between SDE, which is also a popular research area. Clarity: the clarity is good. The paper sequentially introduces several auxiliary SDEs/Markov chains to eventually bridge the gap between the processes of interest: and . The motivation of each proposed auxiliary processes are clearly stated and is easy to follow for readers. Significance: The theoretical contribution is quite good for this area. The authors consider non-Gaussian and state-dependent chains which is not so deeply investigated in literature. The window coupling and "modified" Girsanov theorem may be of independent interest.
缺点
There are some mistakes and typos in proofs:
- In Appendix, the equations in the middle of page.18 (the one after (10)), a noise matrix before is missing. The same issue appears in most of the later equations in this proof.
- In the same equation, the last two terms have a wrong scaling w.r.t. . Specifically, they should be and . Consequently, some following terms should be multiplied by or so. (It seems not to affect the final upper bound, though. It is better to still check the whole roadmap of proofs.)
- I wonder how the last term in the top equation on page.19 is derived (). It seems merely using Assumption 1 (4) is not enough. Or maybe there should be some additional assumptions?
问题
The window coupling seems to be among the crucial parts in this paper. Could authors explain if this is first proposed by this paper, or already appears in literature? It is best to provide some references.
The paper gives better bounds for SGD with Gaussian/non-Gaussian cases [1]. Could authors compare the theoretical techniques used in two papers and explain where do such improvements probably come from?
We are grateful to the Review for the positive review and fair concerns. Let us answer the weaknesses first.
In Appendix, the equations in the middle of page.18 (the one after (10)), a noise matrix .. is missing. ...
In the same equation, the last two terms have a wrong scaling w.r.t. ...
Thank you for noticing that typos! As you mentioned, it does not affect the final bound as we are reducing noise to a form passed into the -CLT assumption. We fixed both typos in the updated revision by keeping outside of summation, which also improved clarity.
I wonder how the last term in the top equation on page.19 is derived. It seems merely using Assumption 1 (4) is not enough. Or maybe there should be some additional assumptions?
Assumption 1 (4) and (5) are enough! Seems like during pre-submission revisioning we lost a chain of arguments, where we show how it follows. We updated the revision with the missing argument. The argument is following: First we note the following algebraic identities:
They hold due to assumptions of , and . Then it follows:
And on questions:
The paper gives better bounds for SGD with Gaussian/non-Gaussian cases [1]. Could authors compare the theoretical techniques used in two papers and explain where do such improvements probably come from?
If we understood correctly, by [1], you mean Alfonsi A. et al., "Optimal transport bounds between the time-marginals of a multidimensional diffusion and its Euler scheme.". To answer this, we can refer to Remark 3.2 in [1], where the authors point out that the bound can be improved if uniform ellipticity is assumed (which we explicitly stated in Table 2 and our Assumption 1). We also note that the work [1] considered a general SDE case and might not cover the SGD case. As for the SGD case, we compare it against [15] and note that the work [15] considers the Wasserstein-1 distance. We believe that improvement in our bound comes mainly from using entropic bounds instead of the Lyapunov-type of analysis, which became possible due to Lemma 4.
The window coupling seems to be among the crucial parts in this paper. Could authors explain if this is first proposed by this paper, or already appears in literature? It is best to provide some references.
Best to our knowledge, we have not seen it in any prior works, however, some intuitive kind of arguments similar to that are common in prior literature (e.g., Latz J., "Analysis of stochastic gradient descent in continuous time"). As for the term , since all prior works assumed dissipativity/convexity, it was not needed for them.
This paper considers a general class of Markov chains called Ito chains that cover a wide range of applications including sampling, optimization, and boosting. The paper provides bounds for the approximation error in W_2 distance between such Ito chains and the corresponding stochastic differential equation (which can then be used to study the Markov chain). In several applications, the bounds improve upon previous results or are completely new.
优点
The results in this paper provide novel and general bounds on the approximation error for Ito chains using the corresponding stochastic differential equation. The results cover a broad range of applications in sampling, optimization, and boosting (for example, SGLD, SGD, and Stochastic Gradient Boosting) and are novel in several such applications (for example, the results cover almost arbitrary isotropic and state-dependent noise). The proof involves several new ideas and the presentation is quite clear.
缺点
It would be nice if the author(s) could comment on whether Assumption 1 in Section 2 is typically satisfied/easy to verify in applications.
问题
It would be nice if the author(s) could comment on whether Assumption 1 in Section 2 is typically satisfied/easy to verify in applications.
We are grateful for your positive review and high score!
Let us answer your question/weakness.
It would be nice if the author(s) could comment on whether Assumption 1 in Section 2 is typically satisfied/easy to verify in applications.
When formulating assumptions, we attempted to cover as much generality as possible to cover/generalize the majority of prior works.
As for checking them, it is helpful to simplify some of them first (with maybe loss of generality), for example:
- -CLT will be satisfied if
- For uniform ellipticity, if we are working with SGD with dissipative/convex loss (e.g., regularized by norm), then it is possible to show that iterations remain in a ball, which, even if the covariance is not bounded by default, can be considered as bounded without loss of generality, which is a standard trick in SGD literature.
- Assumption like will be satisfied if both are Lipshitz-continuous, which in the case of SGD/SGLD is satisfied automatically if the loss (which gradient we are taking as drift are Lipshitz smooth)
- Assumptions on bias/covariance shifts in such cases as smoothed SGDL or SGD will be satisfied automatically if the loss function (which gradient we are taking as drift ) is Lipshitz-smooth, which is quite a standard assumption.
- Generally, checking the assumptions will require proving they are held on a case-by-case basis. However, in most cases, there is something like Lipshitz-smoothness of the loss / Lipshitz continuity of the drift, which makes them hold by default.
Thank you for the detailed response and clarification!
Authors study chains that look like Euler-Maryama discretization of Ito diffusions. The method can handle a wide range of diffusions with state-dependent noise as well as inexact drift and diffusion coefficient, which does include SGLD etc. The papers main contribution is a theoretical control over the distance between the chain and the underlying diffusion. Although similar results already exist in the literature, their analysis has novelty in certain settings.
This paper was reviewed by 3 reviewers and received the following Rating/Confidence scores: 8/3, 6/3, 3/4. The reviewer who is championing the paper has a rather low confidence score and the one strongly rejecting the paper has valid concerns. However, they did not participate in a discussion, as such AC is down-weighting the score. The authors should clearly highlight the novelty in their analysis, e.g. by improving the table in the paper. In addition, authors are missing recent literature on sampling with Langevin Monte Carlo.
I think the paper is overall interesting and should be included in ICLR. The authors should carefully go over and clarify all reviewers' questions/concerns.
为何不给更高分
I think the paper is borderline.
为何不给更低分
n/a
Accept (poster)