6.0

/10

Rejected4 位审稿人

最低6最高6标准差0.0

3.8

置信度

正确性2.0

贡献度2.8

表达2.0

ICLR 2025

Convergence Analysis of the Wasserstein Proximal Algorithm beyond Convexity

Shuailong Zhu,Xiaohui Chen

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

optimization in Wasserstein spaceproximal algorithmgradient flowsamplingmean-field

评审与讨论

审稿意见

评分: 6置信度: 32024-10-19

The authors prove error rates of the proximal point algorithm in the Wasserstein space based on a Polyak-Lojasievwicz property. Moreover, they provide a convergence result under a certain small (exponentially decaying) error within the computation of the Wasserstein proximal mapping. The authors show some proof-of-concept numerics to show the validity of their claims.

优点

The main result of the paper (Thm 3.2) seems to be correct and interesting to me. As far as I know, it extends the knwon convergence rates in terms of the function values. Also the part regarding the inexact update steps is highly relevant, since the solution of the Wasserstein proximal mapping is often just computed numerically.

缺点

I would like to emphasize that for most problems it is the hardest part to prove the PL property for the considered functional and not to derive convergence after the PL property is proven. The paper could benefit significantly from stating some lighter assumptions or examples where the PL property is fulfilled. The relation LSE vs PL goes in the right direction, but should be outlined more in detail.

Also, the authors only consider convergence rates in terms of the function value. They do not even mention existing convergence results in terms of the iterates (for strongly convex functionals, convergence rates in the Wasserstein metrics appear as a special case from [SKL2020], for (not strongly) convex functionals, weak convergence of the iterates was shown in [NS2022]).

In addition, even though I have not seen the main result in the literature, some concepts of the paper already appeared in the literature and the literature part on them should be clarified. Moreover, there exists plenty of technical errors and inexact parts in the paper. I include a list below (not ordered by importance).

Summarizing, the paper feels a bit unfinished. While the main result is interesting, the step towards convergence of the iterates is still missing. Moreover I would suggest to change the "beyond convexity" in the title to "under Polyak-Lojasievwicz property" to directly clarify the assumptions (or somehow else represent this assumption in the title). Finally, the paper currently contains too many small errors and typos for being published. I am willing to raise my score, when the authors correct the errors during the rebuttal phase.

List of confusions:

Did I miss something or is Lemma A.1 just the Gronwall inequality (see https://en.wikipedia.org/wiki/Gr%C3%B6nwall%27s_inequality for references)? No need to reprove it.
The assumptions required for the proof of Lemma A.2 are not sufficient. In order to use Prokorov's theorem for constructing a solution of the proximal map, you need that F is lsc wrt weak convergence in $P_2(\mathbb{R}^d)$ (at least on $W_2$ -bounded sets) which is a stronger assumption than lsc wrt Wasserstein-2. Under this stronger assumption the statement is already explained in [AGS2005] after eqt (10.3.1b).
Regarding Definition 3.1 In the space of probability measures the PL property was already defined and used in [BV2023], see also "entropy-entropy production inequalities" from [KMV2016]. In particular, it is already outlined in [BV2023] that one obtains convergence of Wasserstein gradient flows via the Gronwall inequality. I am clear that this is not the same statement, since you are considering a (backward) discretizations of the gradient flow. However, since the main idea of your proof consists out of applying the Gronwall inequality, this coincidence should be mentioned.
The proof of Lemma A.3 is not fully correct under these assumptions. In order to apply Lemma 10.4.1 from [AGS2005], the additional assumption $\rho\in C^1(\mathbb{R}^d)$ has to be fulfilled. In order to ensure this, you have to assume that your initialization is $C^1$ and that the iterates of the Wasserstein proximal point algorithm remain $C^1$ over the algorithm. A proof for this statement is missing.
The notation in Section 3.2 and in the proof of Thm 3.8 gets completely messed up. The term $\eta_{n+1}$ (appearing first in line 383) remains undefined. Then the assumption of Thm 3.8 states that $\eta<1/L$ while the proof sets in line 785 sets $\eta=1/L$ .

Small comments, formating errors and typos:

To avoid confusion it would be worth mentioning that the Hopf-Lax formula is also known as Moreau-Yoshida approximation (cf. [AGS2005]).
Lemma A.1: u maps from where to where?
Lemma A.2: The domain of F is missing in the statement
Lemma A.3: Assumptions are missing (for instance existance of optimal transport maps, F is lsc etc.).
please define semiconvexity and clarify in Thm 3.8 that it is geodesically semiconvex.
line 334: The last equality should be a greater or equal...
line 783: $\beta$ is missing the index
The green font color for citations is hard to read.

[SKL2020] Salim, Korba, Luise, "The Wasserstein Proximal Gradient Algorithm", NeurIPS 2020

[ES2022] Naldi, Savare, "Weak topology and Opial property in Wasserstein spaces, with applications to gradient flows and proximal point algorithms of geodesically convex functionals", Rendiconti Lincei, 2022

[KMV2016] Kondratyev, Monsaingeon, Vorotnikov, "A new optimal transport distance on the space of finite Radon measures", Advances in Differential Equations 2016

[AGS2005] Ambrosio, Gigli, Savare, "Gradient Flows in Metric Spaces and in the Space of Probability Measures", 2005

[BV2023] Boufadene, Vialard, "On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy", arxiv preprint, 2023.

问题

I stated all my questions and suggestions in the weaknesses part.

2024-11-27

Thanks for your careful comments! Below is our reply to your questions and we revised the paper accordingly.

(On Gronwall inequality) The vanilla Gronwall inequality requires the function to be everywhere differentiable in the interval interior. For our problem, we only have almost everywhere differentiability for the Hopf-Lax semigroup, and the proof of Lemma A.1 (Lemma B.6 in revised version) requires additional monotonicity for the inequality (highlighted in red) to hold. So we added a Remark B.7 to clarify that our inequality extends the classical Gronwall lemma, which requires everywhere differentiability, to the case with almost everywhere differentiability and monotonicity.
(On existence of minimizer) In the revision, we clarified that $F$ needs to be lower semicontinuous (lsc) in the weak topology and put the assumption of weakly lsc in our Assumption 1 as our regularity assumption. We also added a Lemma B.3 to summarize some classical conditions that guarantees weak lower semicontinuity.
(On PL definition) We added the references [BV2023, KMV2016] to the PL definition.
(On subdifferential representation) Thanks for pointing out this technical issue in Lemma A.3! First, we clarify that if $\Theta$ is a compact subset of $\mathbb{R}^d$ , then eqn.(18) in Lemma A.3 remains true $\rho_\xi$ -a.s. for probability measures supported on compact domains (cf. added Lemma B.5). So the difficult arises primarily when $\Theta = \mathbb{R}^d$ (or unbounded domain). Fortunately, our proof of Theorem 3.2 only requires either one of the followings: (a) Lemma A.3, i.e., the representation of strong subdifferential of the form $(T_{\rho_{\xi}}^{\rho}-I) / \xi$ in terms of Wasserstein gradient $\nabla \dfrac{\delta F}{\delta \rho}(\rho_\xi)$ holds along the proximal trajectory, OR (b) $\nabla \dfrac{\delta F}{\delta \rho}(\rho_\xi)$ is the minimal $L_2(\rho_\xi)$ -norm element in the strong subdifferential. Existence (but not uniqueness) of strong subdifferential is guaranteed by [Lemma 10.1.2, (Ambrosio et al., 2005)] when $\rho_\xi\in D(|\partial F|)$ . Thus, we impose a slightly stronger assumption in the revision (Assumption 2 (a) or (b) on proximal trajectory). We remark that this new assumption is not restrictive for our main application MFLD. In particular, we show that the MFLD on the standard two-layer architecture with some regularity on the perceptron for either squared (regression) or logistic (classification) loss is a sufficient condition for Assumption 2(b) in Corollary 3.5 (revised version).
For geodesically strongly convex case (Theorem 3.8 in revised version), we modified our proof to avoid invoking Assumption 2 completely, by heavily utilizing the fact that $(T_{\rho_{\xi}}^{\rho}-I)/\xi$ is a strong subdifferential (Lemma 10.1.2, (Ambrosio et al., 2005)), and provided an iteration rate for the unique minimizer $\rho^*$ under $\mathcal{W}_2$ (in addition to the functional value convergence rate). Similarly as [SKL2020], the $\mathcal{W}_2$ convergence rate on the minimizer is also linear.
For proof of Theorem 3.8, it suffices to prove the case $\xi=1/L$ because for $\xi<1/L$ , a functional that is $(-L)$ geodesically convex is also $-(1/\xi)$ geodesically convex. We apologize for this confusion and revised our proof accordingly.
We apologize notation typos and corrected them as suggested.
For small comments, we have resolved all of them.

(Ambrosio et al., 2005) Ambrosio, Gigli, and Savar´e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

(Chizat, 2022) Lenaic Chizat. Mean-field langevin dynamics: Exponential convergence and annealing, 2022.

2024-11-27

So far I was just roughly reading over the revised version (I will have a closer look in the next days). The authors corrected a large number of errors and confusions. However, I already found again a few confusions:

Regarding the Gronwall inequality: You are basically employing the integral form of the Gronwall lemma. It is contained in most standard text books, Wikipedia, or in the Encyclopedia of Mathematics (Thm 3 of https://encyclopediaofmath.org/wiki/Gronwall_lemma). It just requires $u$ to be measurable and non-negative (no differentiability assumption). I disagree with the claim that Lemma B.6 extends the existing forms of the Gronwall inequality in any direction. If you add continuity to the assumptions (you apply it to the Moreau-Yoshida approximation which is continuous), then the statement is e.g. given in Appendix B, inequality j of the text book "Partial Differential Equations" of L. C. Evans.
I am a bit confused by the newly introduced Assumption 2. It says that you assume that either Assumption 2 (a) or Assumption 2 (b) is fulfilled. But Assumption 2 (a) obviously implies Assumption 2 (b). So wouldn't it be sufficient just to state Assumption 2 (b)?

2024-11-28

We want to express our sincere appreciation for your careful review!

(On Assumption 2)

Thank you so much for pointing out this confusion! We only kept Assumption 2b in our new paper version as you suggested.

(On Gronwall inequality)

Thank you for the references to the integral form of Gronwall inequality! However, from our understanding, it is still different from our lemma to the best of our knowledge. We compare our lemma with the integral form of Gronwall inequality in the following,

(1) Following the similar proof strategy of our Lemma B.6, we can show

If $u(t)$ is a increasing positive function, and $\frac{d}{dt}u(t)\geq u(t)c(t)$ almost everywhere in $(0,\xi)$ , where $c(t)>0$ for $(0,\xi)$ , we have,

u(\xi)\geq u(0)\int_0^{\xi} exp(c(t))dt

we can see that the inequality direction differs from Gronwall inequality under this setting. We believe this already shows some differences.

(2) If $u(t)$ is a decreasing positive function, and $\frac{d}{dt}u(t)\leq u(t)c(t)$ almost everywhere in $(0,\xi)$ , where $c(t)<0$ for $(0,\xi)$ , we seemly have some "Gronwall-type" inequality in this case,

u(\xi)\leq u(0)\int_0^{\xi} exp(c(t))dt

And in this case, by monotonicity, we actually have

u(\xi)\leq u(0)+\int_0^{\xi}u(t)c(t)dt

which is very similar to the integral form of Gronwall lemma.

However, without monotonicity, we will not have the equation above. On the other hand, in the equation above, $c(t)<0$ which is different from (J) equation in Appendix B of Evans' book and Theorem 2 in Encyclopedia, where $c(t)>0$ is assumed.

(3) In conclusion, (J) equation in Appendix B of Evans' book and Theorem 2 in Encyclopedia can control that $u(\xi)$ at most increases "exponentially". However, our Lemma B.6 can guarantee that $u(\xi)$ at least shrinks "exponentially" (or at least increases "exponentially" by (1)).

2024-11-29

Thank you so much for your feedback and appreciate your careful reviewing again!

(On Remark 3.3) We highly agree that current illustration in Remark 3.3 is not straightforward and thank you for pointing this out. We will mainly modify Remark 3.3 from the following two aspects as you suggested.
- We directly state Lemma B.5: If $\Theta$ is compact, then we have
  $\nabla\frac{\delta F}{\delta \rho}(\rho_\xi)= ({T_{\rho_\xi}^{\rho}-\mathbf{id}})/{\xi}$
  Therefore, Assumption 2 holds.
- For $\Theta=\mathbb{R}^d$ , we explain why ” $(T_{\rho_\xi}^{\rho}-\mathbf{id}) / \xi$ is a strong subdifferential of $F$ at $\rho_\xi$ “ is important: If $\Theta=\mathbb{R}^d$ , we can not directly have $\nabla\frac{\delta F}{\delta \rho}(\rho_\xi)= ({T_{\rho_\xi}^{\rho}-\mathbf{id}})/{\xi}$ as in compact case (we will add some discussion in Lemma B.5). However, by [Lemma 10.1.2, (Ambrosio et al., 2005)], $\rho_{\xi}\in D(|\partial F|)$ and $(T_{\rho_\xi}^{\rho}-\mathbf{id}) / \xi$ is a (not necessarily unique) strong subdifferential. If we can prove $\nabla\dfrac{\delta F}{\delta \rho}(\rho_\xi)$ is also a strong subdifferential with minimum $L_2(\rho_\xi)$ -norm under some conditions, then Assumption 2 holds.
(On PL inequality) If we understand your question correctly, you want to have an example that doesn't contain that many conditions as in Corollary 3.5. The example in Corollary 3.5 is indeed involved. We provide an abstract version of it here, also see (Chizat, 2022):

Assume the functional can be written as $F_{\tau}=R(\rho)+\tau \int \rho log\rho$ . If $F_\tau$ satisfies the following two properties, then the PL inequality holds,

(1) $R$ is convex in the linear sense. (Not geodesic convexity)

(2) Uniform LSI: see Definition C.1 in our paper, also see [Assumption 3, (Chizat, 2022)]. Basically, it requires every proximal Gibbs distribution to satisfy LSI with a constant.

Under these two conditions, we can prove a PL inequality, see (2) in the proof of Corollary 3.5 or [Section 3, (Chizat, 2022)].
(On Gronwall inequality) In our Lemma B.6, we proved this "Gronwall" inequality for a monotone function. As you pointed out, we believe the proof is already somewhere. However, we kept the proof in our paper for completeness.

We are sorry for the format issues: we will clean the references, and solve the format issues related to the "norm" and Theorem 3.4.

If you have further questions, please don't hesitate to let us know! Thank you again for your constructive comments!

(Ambrosio et al., 2005) Ambrosio, Gigli, and Savar´e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

(Chizat, 2022) Lenaic Chizat. Mean-field langevin dynamics: Exponential convergence and annealing, 2022.

2024-11-29

I was reading through parts of the revised paper. The authors have addressed several of the comments and I raised my score.

Some comments and concerns remain:

Most readers won't have any chance to understand the meaning of Remark 3.3. Please explain, why it is interesting that $(T-id)/\xi$ is a strong subdifferential. Avoid referencing future claims (Cor 3.5 and 3.6). Why does the existence of the first variation implies its specific form (it would be better to directly state Lemma B.5 instead of writing a sentence which cannot be understood at this point).
In order to distinguish PL over LSI: Can you give some specific examples (despite the KL divergence) where PL holds?
Regarding the Gronwall inequality: I see that my reference was not correct due to the varying sign of the C. But I am sure that a form of the Gronwall inequality for continuous, almost-everywhere differentiable $u$ already exists (for instance the Wikipedia proof never uses the strict everywhere differentiabiltiy and has no sign restriction on C) and would do the job in your case (Your modification (1) can also be done for the standard proof of Gronwall). However, since I believe that Lemma B.6 is correct (just not new) and it is not considered as a main contribution of this paper (neither by authors nor by reviewers), I will just ignore this issue for this review.
Even though it does not influence my score, the authors definitely should invest significantly more efforts in proofreading and formatting their paper. E.g.:
- As other reviewers pointed to that, the reference list still does not look appropriate (references which have neither journal nor arxiv identifiers, wrong capitalization etc.)
- stuff like using \left and \right for the norm lines throughout the paper (e.g. in Assumption 2)
- The proof of Thm 3.4 does not fit the margins

审稿意见

评分: 6置信度: 42024-10-24

This paper provides convergence analysis of (inexact) Wasserstein proximal algorithm (WPA) for minimizing functionals satisfying PL inequality, where linear and unbiased convergence rates are achieved. This paper also sharpens the previous convergence analysis under strong geodesic convexity. Experiments on sampling from 1-D standard Gaussian and training mean-field neural network validate the faster convergence of Wasserstein proximal algorithm.

优点

This paper is the first to analyze Wasserstein proximal algorithm under PL-inequality and obtains linear and unbiased convergence rates, which is a generalization of proximal algorithm in Euclidean space without smoothness and convexity assumptions. For KL objective, PL-inequality is equivalent to log Sobolev inequality (LSI) which is interesting to MCMC community.
This paper also sharpens the analysis of [Yao & Yang, 2023] and [Cheng et al., 2024] under strong geodesic convexity.

缺点

This paper doesn't show the implementation details of WPA. The proximal operator is intractable to calculate in the Wasserstein space.
The convergence rate under PL for KL objective resembles the convergence rate of the proximal sampler under LSI (Thm 3 of [Chen et al., 2024]}. However, the proximal sampler seems to be more implementable.
The experiments are weak: the experiments only consider low dimension examples, but the large-scale problem setting is more interesting to machine learning and sampling community and the provided algorithm should be robust to large-scale problems. The implementation of WPA in training mean-field neural networks is ambiguous, especially for (17).
This paper is short of some references discusing training mean-field neural network using mean-field Langevin algorithm such as [Fu & Wilson., 2023] and [Kook et al., 2024].

[Chen et al., 2024] Chen, Yongxin, et al. "Improved analysis for a proximal algorithm for sampling." Conference on Learning Theory. PMLR, 2022.

[Kook et al., 2024] Kook, Yunbum, et al. "Sampling from the mean-field stationary distribution." The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024.

[Fu & Wilson., 2023] Fu, Qiang, and Wilson, Ashia. "Mean-field underdamped Langevin dynamics and its space-time discretization." arXiv preprint arXiv:2312.16360 (2023).

问题

The WPA seems to be a natural generalization of proximal algorithm in Euclidean space, but the implementation is still unclear to me. How can you implement WPA generally?
For the application of WPA in sampling, as I mentioned in Weakness part, the rate of WPA for KL objective is the same as the rate of proximal sampler, but the proximal sampler is easier to implement. How is WPA competitive with proximal sampler?
For the application of WPA in training mean-field neural networks, how can you implement update (17)? Can this be exactly solved or approximation? There should be more clarification on that. Also, this paper claims the unbiasedness with particle approximation as the only error resource. Can you show the approximation error in terms of the number of particles $N$ ? This could be interesting in this application. As is shown in [Kook et al., 2024], you can apply any unbiased log-concave samplers to sample from the empirical distribution, and the distance (KL, W2) between the empirical distribution and the solution of training mean-field neural networks is non-asymptotically bounded in terms of $N$ . How is WPA competitive with other unbiased samplers?

2024-11-27

(On the implementation of WPA) To compute $T^{n+1}$ that maps $\rho_n$ to $\rho_{n+1}$ , each step we will train a small neural network to minimize the loss of the right hand side of (17), and thus can approximate $T^{n+1}$ . In our case where an entropy regularization is present, we use a neural network to compute the map $T^{n+1}$ . Since the particle number is finite, $H(\rho)=\int \rho \log\rho$ is not well-defined and cannot be estimated directly. We instead compute the map $T^{n+1}$ via a change of variable formula, see [Example 2, (Yao et al. 2024)]. For details of the change of variable formula, we refer to [Theorem 1, (Mokrov, et al., 2021)]. We also remark that the training of a transport map is closely connected to the proximal point algorithm in Euclidean space:

$(\theta_{j}^{n+1})=argmin_{T}\ \dfrac{1}{N}\sum_{i=1}^{N}l(\dfrac{1}{m}\sum_{j=1}^m \varphi(\theta_j^{n+1},x_i),y_i)+\dfrac{\lambda}{m}\sum_{j=1}^{m}\Vert \theta_j^{n+1}\Vert^2 +\dfrac{1}{2m\eta}\sum_{j=1}^{m}\Vert \theta_j^{n+1}-\theta_j^n\Vert^2$

which is applicable when there is no entropy regularization term. In the noise-free case, we might not need to introduce a neural network to compute $T^{n+1}$ . Instead, we can directly compute the particle position at next iteration. However introducing a neural network allows us to learn a map for each step, which provides certain flexibility. For instance, if the number of particles changes, we need to compute (*) for each step from scratch if we don't compute a map for each step. By contrast, if we already trained a neural network, then we are able to reuse the trained neural network for each step.
(On proximal sampler) For KL divergence, (Chen et al., 2022) proposed a proximal sampler. At every step, it requires to implement an independent approximate RGO for each particle. If we want to increase the number of sampled particles, we need to repeat the implementation of approximate RGO for newly added particles. Even though the smoothness condition (or semiconvexity of $f$ ) is not listed in [Theorem 3, (Chen et al., 2022)], the chain rule used in the main lemma (Lemma 12) implicitly requires geodesic semiconvexity (for KL divergence it requires $f$ to be semiconvex), see (Section 10.4.E, (Ambrosio et al., 2005)). And also, only if $f$ is smooth, then approximate RGO can be implemented "effectively", for example by gradient descent. In addition, we want to stress here that all of the bound established in (Chen et al., 2022) are only based on the setting of sampling from a fixed function.

Under LSI condition and semiconvexity, WPA and proximal sampler share the same convergence rate for sampling from a fixed function.
(On Unbiased Sampler and the application to MFLD) First, we do not cover the particle discretization error, which is a challenging problem for proximal gradient methods. (Kook et al., 2024) and some other works introduce various methods to analyze the error induced from time-discretization error and particle-discretization error, for "Langevin" type algorithms. We have added the discussion in our appendix. However, to the best of our knowledge, (Kook et al., 2024) mostly focused on a general framework for analyzing the error for mean-field sampling (specifically, Langevin-type algorithm). There is no new algorithm proposed. However, our paper focus on the time-discretization error of Wasserstein proximal algorithm when geodesic convexity is missing and leave the particle discretization error to future.
(On experimental setup) For the experiments, choosing $d=2$ is because the higher the dimension is, the more particle we need to well approximate the two time-discretizations, Langevin algorithm (assume infinite particle) and proximal algorithm. For example, the experiments in (Chen et al., 2024) also uses $d=2$ to demonstrate the influence of particle number.

(Yao et al., 2024) "Wasserstein proximal coordinate gradient algorithms." JMLR, 2024.

(Chen et al., 2022) Chen, Yongxin, et al. "Improved analysis for a proximal algorithm for sampling." Conference on Learning Theory. PMLR, 2022.

(Xu et al., 2024)Xu C, Cheng X, Xie Y. Normalizing flow neural networks by JKO scheme[J]. Advances in Neural Information Processing Systems, 2024, 36.

(Kook et al., 2024) Kook, Yunbum, et al. "Sampling from the mean-field stationary distribution." The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024.

(Ambrosio et al., 2005) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

(Chen et al., 2024) Chen F, Lin Y, Ren Z, et al. Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics[J]. Electronic Journal of Probability, 2024, 29: 1-43.

(Mokrov et al., 2021) Petr Mokrov, et al. Advances in Neural Information Processing Systems, 2021.

2024-11-27

Thank you for the clarification on the implementation of WPA by training neural networks. For proximal sampler, we can control the error of implementing RGO by convex optimization or rejection sampling, but the error of implementing WPA is still unknow, which could be a limitation. However, this seems to be a new perspective for sampling. In this case, I will raise my score to 6.

2024-11-28

Thank you for your feedback! We believe that estimating the particle discretization error of WPA will be an interesting direction in the future.

审稿意见

评分: 6置信度: 52024-10-27

The convergence of the Wasserstein proximal algorithm without assuming geodesic convexity on the objective functional and the improvement of convergence rates also for the geodesically convex case is a very interesting topic as well as the extension to the inexact proximal algorithm and numerical experiments underlining the theory are highly welcome. So the intention of the authors is fine and ''roughly'' the methods and proofs fit. However, the mathematical realization lacks accuracy which is necessary for such a paper. The authors should go through their paper step by step, pose correct assumptions, metrics and so on, starting with my remarks below.

优点

The task to show unbiased and linear convergence rate of the general-purpose Wasserstein proximal algorithm for optimizing a functional under merely a Polyak-Łojasiewicz (PL) inequality is interesting and the numerics with two-layer wide neural networks is promising.

缺点

Both the mathematical writing and the English is faulty (below I list only few of the shortcomings I detected, but there are many more). The mathematical notation is not used in a correct way, the authors have to give all and correct assumptions for their claims to hold true.

问题

formula (2): maybe change $\rho'$ to something else, since the first one looks like a derivative
line 067: there is a notation switch between $\varphi$ and $\phi$
the authors write that they ,,assume that the strong subdifferential exists for every $\rho \in \mathcal P_2^a$ ''; this will never be the case; you do not mean this, so please fix
- line 193 ''satisfy'' instead of ''satisfies''
- line 208: A functional $F$
- line 232: $v$ should be $\nu$
- line 277: ''any stationary point ... is a global minimum $F^*$ ''. Please reformulate, it is a global minimizer - not a minimum
- Definition 3.2: give a reference to the Hopf-Lax ''formula''; maybe write that many definitions are taken from Ambrosio 2013; note that it resembles the Moreau envelope in the Hilbert space setting; further there is a notation mismatch: $\xi$ was used as subgradient, now it is the regularization parameter, in (2) this was $\eta$ . The definition supposes that a (unique) proximum exists. I could live without ''unique'' if the authors write $\rho_\xi \in prox_{F,\xi} (\rho)$ , but then the existence is only clear with additional assumptions on $F$ . In Lemma A2 the authors clarify that $F$ has to be lower semicontinuous (but this is not enough, see below).
- Lem A.1: under the correct assumptions (you also need differentability), this is the well-known Gronwall lemma; why you don't use this name. Please reformulate also in correct English.
- Lem A.2: The lemma is wrong (also in Hilbert spaces if you assume just $F$ is lsc, take just $F(y)= - exp(y)$ ), not only the formulation ''algorithm (2) admits a minimizer'' . Suppose that the authors mean that the Wasserstein proxy exists for all $\rho$ . Please address which metric is used (weak^* convergence) The closed balls in $\mathcal P_2(\mathbb R^d), W_2)$ are known to be tight, see (M. Yue, D. Kuhn, and W. Wiesemann. On linear optimization over Wasserstein balls. Math. Program., 2021, Theorem 1). On the other hand, under correct assumptions you can find the result of the whole lemma in Ambrosio's book.
- line 676 what is $\mathcal B_C$ ?
- line 681: it must be $K$ not $K_i$
- Lem A3: Do you need some smoothness assumptions on $\rho$ ?
- Lem A4: This is exactly Prop. 3-1 , 3.3 Ambrosio 2023. Further (2) is ''if and only if''
  - line 716: there should be a ''='' between $D_+$ and $D_-$
- Proof of Lem 3.1: Since Lem A2 is not correct it cannot be used in the proof.
- Def B.1: reformulate, what is defined here Proof of Thm 3.2: line 316: why you have an additional $\xi$ in the second summand? line 333: the last equality should be an inequality The rest appears to be correct.
  - Cor 3.4: what do you mean by $\varphi(x,y)$ ''is uniformly in $y$ '' (uniformly what?)
clean up the reference list, in particular please respect capitals

2024-11-27

Thank you so much for your careful reviewing. Some similar questions were also raised by Reviewer jLpS. So we answer those questions first.

(On Gronwall inequality) The vanilla Gronwall inequality requires the function to be everywhere differentiable in the interval interior. For our problem, we only have almost everywhere differentiability for the Hopf-Lax semigroup, and the proof of Lemma A.1 (Lemma B.6 in revised version) requires additional monotonicity for the inequality (highlighted in red) to hold. So we added a Remark B.7 to clarify that our inequality extends the classical Gronwall lemma, which requires everywhere differentiability, to the case with almost everywhere differentiability and monotonicity.
(On existence of minimizer) In the revision, we clarified that $F$ needs to be lower semicontinuous (lsc) in the weak topology and put the assumption of weakly lsc in our Assumption 1 as our regularity assumption. We also added a Lemma B.3 to summarize some classical conditions that guarantees weak lower semicontinuity.
(On subdifferential representation) Thanks for pointing out this technical issue in Lemma A.3! First, we clarify that if $\Theta$ is a compact subset of $\mathbb{R}^d$ , then eqn.(18) in Lemma A.3 remains true $\rho_\xi$ -a.s. for probability measures supported on compact domains (cf. added Lemma B.5). So the difficult arises primarily when $\Theta = \mathbb{R}^d$ (or unbounded domain). Fortunately, our proof of Theorem 3.2 only requires either one of the followings: (a) Lemma A.3, i.e., the representation of strong subdifferential of the form $(T_{\rho_{\xi}}^{\rho}-I) / \xi$ in terms of Wasserstein gradient $\nabla \dfrac{\delta F}{\delta \rho}(\rho_\xi)$ holds along the proximal trajectory, OR (b) $\nabla \dfrac{\delta F}{\delta \rho}(\rho_\xi)$ is the minimal $L_2(\rho_\xi)$ -norm element in the strong subdifferential. Existence (but not uniqueness) of strong subdifferential is guaranteed by [Lemma 10.1.2, (Ambrosio et al., 2005)] at proximal $\rho_\xi$ . Thus, we impose a slightly stronger assumption in the revision (Assumption 2 (a) or (b) on proximal trajectory). We remark that this new assumption is not restrictive for our main application MFLD. In particular, we show that the MFLD on the standard two-layer architecture with some regularity on the perceptron for either squared (regression) or logistic (classification) loss is a sufficient condition for Assumption 2(b) in Corollary 3.5 (revised version).

Next we address other questions:

"the authors write that they assume that the strong subdifferential exists for every $\rho\in\mathcal{P}_2^a(\Theta)$ ''; this will never be the case; you do not mean this, so please fix."

Reply: We apologize for the confusion. We clarified that our result relies on the fact that strong subdifferential at the proximal $\rho_\xi\in\mathcal{P}_2^a(\Theta)$ .

"Lem A.3: Do you need some smoothness assumptions on $\rho$ ?"

Reply: Yes, [Lemma 10.4.2, (Aambrosio, et al. 2005)] requires smoothness assumption we ignored before. Please see our detailed reply in (On subdifferential representation).

"Lem A4: This is exactly Prop. 3-1 , 3.3 (Ambrosio 2013)."

Thanks. This part is not our contribution, and thus we put it to appendix and cited (Ambrosio 2013).

"Def B.1: reformulate, what is defined here Proof of Thm 3.2: line 316: why you have an additional $\xi$ in the second summand? line 333: the last equality should be an inequality The rest appears to be correct."

We reformulated Def B.1 (now Def C.1 in revised version). The additional $\xi$ and the last equality is a typo. We fixed this.

"Cor 3.4: what do you mean by φ(x,y) ''is uniformly in y'' (uniformly what?)"

We have corrected this part in Corollary 3.8 in our new version, also we refer to Proposition 5.1 (Chizat, 2022) for details.

We also cleaned up the reference list for capitals.

(Ambrosio et al., 2005) Ambrosio, Gigli, and Savare. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

(Chizat, 2022) Lenaic Chizat. Mean-field langevin dynamics: Exponential convergence and annealing, 2022

2024-11-27

Finally, the authors have improved the paper. However, at first glance I can still detect too many typos, so that I leave the score at 5.

2024-11-29

Thank you for your feedback and we are sorry for the typos, especially in our original version.

We have uploaded a new version in which we corrected most of the typos, and we would appreciate it if you could check the two main problems we solved. To save your time, we highlight the main changes below,

On the existence of minimizer [Assumption 1, Lemma B.3, Remark B.4]

We put the weakly lsc condition to our Assumption 1, and we refer to Lemma B.3 and Remark B.4 for discussions on conditions that ensure weakly lower semicontinuity.
On subdifferential representation [Assumption 2, Remark 3.3, Lemma B.5, proof of Corollary 3.5]:

We proposed Assumption 2 and discussed two settings that Assumption 2 can be satisfied in Remark 3.3.

If you have additional questions, please don't hesitate to let us know. Thank you for your time and constructive comments again!

2024-12-02

Raised the score to 6. Write journals with capitals in all references

2024-12-02

Thank you for your feedback! We will solve the issue of the reference.

审稿意见

评分: 6置信度: 32024-11-05

The linear convergence of Wasserstein proximal point method is established based on a Wasserstein analogue of the Euclidean PL inequality and the convergence of the algorithm when the subproblem is solved inexactly is also studied.

优点

New convergence results based on the PL inequality.

缺点

It seems the analysis is well aligned with that in optimization. So what is the key challenges in the analysis?

问题

For $\mu$ -strongly convex objective, this paper provides a sharper result. How is it achieved compared to existing work?
Except the common ones mentioned in this paper, are there any functionals that could satisfy the PL inequality?
In optimization, PL is special case of KL (with exponent $1/2$ ). Is it also possible to study the Wasserstein proximal point method under the more general KL condition?

2024-11-27

For Reviewer ppSk

(Connection to Euclidean case) Under convexity and PL inequality in Euclidean space, a linear convergence of proximal algorithm can be proved, which is a classical result. However, there is no strict proof to show that under merely PL inequality, the linear convergence can be achieved. Our proof strategy actually can provide a strict proof for linear convergence of proximal under merely PL inequality in Euclidean space. For Euclidean case, with the strong lemma from (Ambrosio, 2013), the main challenge will only be how to obtain a "Gronwall" lemma for function that is only almost everywhere differentiable, see Lemma B.6 and Remark B.7.
(Key Challenges in Wasserstein space) In Wasserstein space, the analysis becomes more difficult. Except for the aforementioned generalization of the "Gronwall" lemma, there is an additional main challenge. In Wasserstein space, the strong subdifferential might not exist everywhere and might not be unique. However, in Euclidean case where we have $0=\theta_{\xi}-\theta+\xi\nabla f(\theta_{\xi})$ for $\theta_\xi=argmin f(\tilde\theta)+\dfrac{\Vert\theta-\tilde\theta\Vert^2}{2\xi}$ , when we assume $f$ is differentiable everywhere. In Wasserstein space we don't have that $\dfrac{T_{\rho_\xi}^{\rho}-I}{\xi}=\nabla\dfrac{\delta F}{\delta \rho}(\rho_\xi)$ directly, see Remark 3.3. Solving this problem is generally quite technical and might involve some other conditions (such as smoothness of the cost functional, Assumption 3 in (Chizat, 2022)), which are not needed for the Euclidean case.
(On Shaper bound for strongly convex objective) The main ingredient comes from the proof of Theorem 3.4 (paper in new version), where we utilized the Lemma 3.1 which connect the time derivative of Hopf-Lax formula (a.k.a. Moreau envelope) and the distance between the proximal $\rho_\xi$ and $\rho$ . Specifically, we are able to compute an integral over time $\xi$ , which allows us to have a better bound. In this way, for every $\rho_\xi,$ where $\xi\in[0,\eta]$ , we can utilize the definition of strong convexity along geodesics.

However, the classical proof for strongly convex objective, only utilizes strongly convexity along geodesics for $\xi=\eta$ once, which leads to a weaker bound.

Again, if we want to extend a faster rate for strongly convex objective in Euclidean space, it is much easier than Wasserstein space case and straightforward as strong convexity implies PL inequality in Euclidean space.
(Other cases of PL inequality) We know one example from (Chizat, 2022) on Kernel Maximum Mean Discrepancy. Under [Proposition 5.2, (chizat, 2022)], the objective functional satisfies PL type inequality.
(Extension to KL inequality) Thank you for your suggestion! We believe it is possible to extend our analysis to Wasserstein KL inequality setting, and there is already some research on continuous time dynamics under Wasserstein KL inequality (Blanchet, et al., 2018).

(Chizat, 2022) Lenaic Chizat. Mean-field langevin dynamics: Exponential convergence and annealing, 2022. URL https://arxiv.org/abs/2202.01009

(Ambrosio et al., 2013) Calculus and heat flow in metric measure spaces and applications to spaces with Ricci bounds from below. Inventiones mathematicae, 195(2):289–391, February 2013. URL http://dx.doi.org/10.1007/s00222-013-0456-1.

(Blanchet, et al., 2018) Blanchet A, Bolte J. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions[J]. Journal of Functional Analysis, 2018, 275(7): 1650-1673.

AC 元评审

2024-12-21

The authors studied the convergence of a proximal Wasserstein gradient descent method under a PL-inequality. While this type of result is expected, as far as any of the reviewers or myself know, this seems to be novel.

However, the rest of this work is questionable in terms of novelty. It would be desirable to have a full analysis of the inexact algorithm, as implementation in practice can only approximate the proximal map. In particular, the authors made a very strange assumption in Theorem 3.10 where the approximation error decays with the number of steps. After a brief discussion, both myself and several reviewers believe this assumption is too restrictive. Therefore at this point, I would consider this analysis incomplete.

Given that the average score is very borderline, and no reviewer is will to champion this paper towards acceptance, I'm afraid I would have to recommend reject.

That being said, I believe an improved discretization analysis would indeed yield a very good paper. At the same time, I believe Theorem 3.4 is a welcomed result to the literature, and perhaps it would be a better fit for another venue such as TMLR, where the novelty of results is not being prioritized.

审稿人讨论附加意见

The discussion period between the authors and reviewers mostly addressed typos. confusions, and missing descriptions of algorithms.

That being said, the brief discussion period between AC and the reviewers were quite helpful actually. This clarified my understanding of where the paper is positioned in context of the rest of the literature, both in terms of theory and implementation.

最终决定Reject

2025-01-22

Reject