PaperHub
6.1
/10
Spotlight4 位审稿人
最低3最高4标准差0.4
3
4
3
3
ICML 2025

Convergence of Mean-Field Langevin Stochastic Descent-Ascent for Distributional Minimax Optimization

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-16
TL;DR

We derive a $\tilde{O}(1/\epsilon)$ convergence rate for dicrete-time mean-field Langevin stochastic gradient descent-ascent algorithm for solving distributional minimax optimization.

摘要

We study convergence properties of the discrete-time Mean-Field Langevin Stochastic Gradient Descent-Ascent (MFL-SGDA) algorithm for solving distributional minimax optimization. These problems arise in various applications, such as zero-sum games, generative adversarial networks and distributionally robust learning. Despite the significance of MFL-SGDA in these contexts, the discrete-time convergence rate remains underexplored. To address this gap, we establish a last-iterate convergence rate of $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for MFL-SGDA. This rate is nearly optimal when compared to the complexity lower bound of its Euclidean counterpart. This rate also matches the complexity of mean-field Langevin stochastic gradient descent for distributional minimization and the outer-loop iteration complexity of an existing double-loop algorithm for distributional minimax problems. By leveraging an elementary analysis framework that avoids PDE-based techniques, we overcome previous limitations and achieve a faster convergence rate.
关键词
mean-field Langevin dynamicsminimax optimizationWasserstein gradient flow

评审与讨论

审稿意见
3

This paper studies the convergence rate of discrete-time mean-field Langevin stochastic descent-ascent for min-max problems in distributional optimization under log-Sobolev inequality condition. The authors claim that the derived convergence rate is near-optimal compared to its Euclidean counterpart. The paper includes two examples: zero-sum games and mean-field neural networks. However, no experimental results are provided.

给作者的问题

See my comments in weaknesses.

论据与证据

The theoretical results are consistent with the authors' claims.

方法与评估标准

Not applicable.

理论论述

I did not verify the details of the proofs.

实验设计与分析

Not applicable.

补充材料

No.

与现有文献的关系

The topic is important to optimization theory and has significant relevance to the machine learning community

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The topic is interesting.
  2. The paper is easy to read.

Weaknesses:

  1. The analysis and results appear elementary and straightforward, closely following previous works such as Kim et al. (2024), Yang et al. (2020), Chen et al. (2022), Suzuki et al. (2023), and Nitanda et al. (2022). Specifically, the variable lambda in the Lyapunov function seems unnecessary, as it can be set to 1 in this case. Given this, the results appear quite direct based on current results in MFLD, limiting the paper’s technical contribution.

  2. What are the advantages of the MFL-SDA algorithm compared to MFL-AG and MFL-ABR? The lack of comparison makes the motivation unconvincing. A discussion of their relative strengths and weaknesses would improve clarity.

  3. The two provided examples are restricted to two-layer neural networks, and no experimental evidence is given. I would expect more interesting applications and empirical results to strengthen the paper’s impact.

其他意见或建议

In Eq (9):

L1(u) = max E(u,v') - minmax E(u',v') L2(u,v)=max E(u,v') - E(u,v)

作者回复

Thank you for your thoughtful comments! Below, we respond to the concerns and clarify the novelty and contributions of our work.

On our analysis: We would clarify that both our proof technique and the resulting conclusions differ from the aforementioned papers in several fundamental ways. First, these prior works (Kim et al. (2024), Chen et al. (2022), Suzuki et al. (2023), Nitanda et al. (2022)) tackle the discrete-time MFLD problem by first establishing convergence in continuous time (typically via Wasserstein gradient flow) and then controlling the discretization error as a separate step. In contrast, our proof directly analyzes the per-step improvement using Taylor expansion and remainder estimates. This not only provides a more elegant and adaptable proof framework, but also yields stronger results as detailed in the next point.

Moreover, our results also differ in several key aspects. Kim et al. (2024) do not study MFL-SDA. To our best knowledge, their analysis of other algorithms does not extend to our established convergence rate of MFL-SDA. While Yang et al. (2020) study a minimax optimization in Euclidean space, and our analysis shares some structural similarities with theirs, the time-discretization error anlaysis of the probability functional on distributional space requires substantially more sophistication -- we devote pages of proofs to controlling this error, whereas in the Eucliean case, it follows directly from smoothness assumptions. The other cited references--Chen et al. (2022), Suzuki et al. (2023), and Nitanda et al. (2022)--focus on minimization problems, whereas minimax problems present more complexity due to the interaction between two players.

On the Lyapunov function: As you said, λ\lambda can be fixed to 1. Yet, our chosen form of the Lyapunov function follows what is commonly used in previous works on minimax problems in Euclidean space (Yang et al. (2020)) and in continuous-time MFLD (Lu (2023)).

Comparison with MFL-AG and MFL-ABR: First, our analysis of stochastic gradient descent-ascent with last-iterate convergence and the inexact gradient analysis closely aligns with practical implementations.
Second, MFL-AG achieves a sample complexity O(ϵO(1/α))O(\epsilon^{-O(1/\alpha)}) to reach an O(ϵ)O(\epsilon)-approximation of the saddle point, where the bound on LSI constant α\alpha can be small--potentially even of order 1/(2+d/2)1/(2+d/2). In contrast, our convergence analysis for MFL-DA establishes a sample complexity O(1ϵlog1ϵ)O(\frac{1}{\epsilon}\log\frac{1}{\epsilon}). MFL-ABR is a double-loop algorithm, it is shown that the outer loop has an O(1ϵlog1ϵ)O(\frac{1}{\epsilon}\log\frac{1}{\epsilon}) sample complexity; however, it needs an inner loop and the total complexity is not specified.

On Experimental Results and Applications: The primary aim of our work is to establish convergence guarantees for commonly used stochastic gradient descent-ascent algorithms. We acknowledge the value of empirical validation, we have included a numerical experiment based on Example 2 in our paper. We apply our algorithm to the nonlinear instrumental variable (NPIV) regression problem $$ \min_f \max_g E[g(Z)(Y-f(X))-\frac{1}{2}g(Z)^2+\lambda R(f)]

where $R$ is a regularizer. Using the problem setup and datasets in [1,2], we compare our algorithm with a classic series-approximation (SA) method [3] based on out-of-sample MSE (lower values indicate better performance) and average R² (higher values indicate better performance) *Engel Curve* | Method | Average MSE | Average R² | | ------------------------- | ----------- | ---------- | | MFL-SDA | **0.00698** | **0.256** | |SA | 0.00730 | 0.218 | *Returns to schooling* | Method | Average MSE | Average R² | | ------------------------- | ----------- | ---------- | | MFL-SDA | **0.1494** | **0.0799** | | SA | 0.1626 | 0.1543 | **4. Clarification of Equation (9):** Good catch! We will revise Equation (9) in the updated version. Thank you again for your feedback, and we hope our response clarifies the contribution of our paper. ### References: [1]Hausman, J. A., Newey, W. K., & Powell, J. L. (1995). Nonlinear errors in variables estimation of some Engel curves. *Journal of Econometrics*, *65*(1), 205-233. [2] Card, David. Estimating the return to schooling: Progress on some persistent econometric problems. *Econometrica* 69.5 (2001): 1127-1160. [3] Newey, Whitney K., and James L. Powell. Instrumental variable estimation of nonparametric models. *Econometrica* 71.5 (2003): 1565-1578.
审稿意见
4

This paper analyzes a natural algorithm for distribution min-max optimization, which consists in taking alternating Langevin steps. The main contribution of the paper is theoretical analysis of this algorithm for the case where the gradients are exact as well as the case where the gradients are in-exact. Their analysis avoids the more standard approach which first analyzes the continuous time flow and then bounds the discretization error. The main application is to mean-field networks.

给作者的问题

Please address the issue about the size of the bias term R1R_1.

论据与证据

I am seriously concerned about their claim, after Theorem 1, that the bias term R1R_1 is O(η1)O(\eta_1). In fact, when I examine the definitions of Γ1,Γ2,Γ3\Gamma_1, \Gamma_2, \Gamma_3, it looks to me that they are in fact O(1/η1)O(1/\eta_1) (this is coming from the final terms in the definitions of rg2,rh2r_{g2}, r_{h2}). Unless there is a typo, this would seem to indicate that the bias term R1R_1 is in fact O(1)O(1), meaning that their main result only shows that the objective doesn't increase more than O(1)O(1) throughout the trajectory. I assume that this problem extends to Theorem 2, but the relevant constants don't seem to be defined in Section A.3 (where are they?).

Another -- although less critical -- issue is Assumption 3 on the log-Sobolev constants. Although they consider several applications of their results, they don't demonstrate any settings where their assumptions hold. Of course, verifying their assumptions for neural networks is likely far too difficult, but as a sanity check, I would like to seen an example of a setting where Assumption 3 actually holds. I think this would make the paper stronger.

Finally, I am probably confused, but in Corollary 1, I don't understand why there must be unique μ,ν\mu^*, \nu^* that the algorithm even converges to? What am I missing here?

方法与评估标准

Yes.

理论论述

I checked some of the proofs in the appendix, but not all of them and not in full detail. I didn't find any serious issues, and at a high level the results seem plausible. The main concern is the dependence of the bias term on the step-sizes η1,η2\eta_1, \eta_2, see above.

实验设计与分析

There were no experiments but I don't think that's necessary.

补充材料

See above.

与现有文献的关系

The paper is trying to push forward on discrete time analysis of distributional min-max problems. This is certainly a worthy problem, and it seems their contribution would be novel. However, I am seriously concerned about the bias terms.

遗漏的重要参考文献

None that I am aware of.

其他优缺点

If the bias term were truly O(η1)O(\eta_1) the paper would be a strong theoretical contribution. In fact, it would probably even imply a new analysis of discrete-time Langevin dynamics since that seems to be a special case of their setup. But this is exactly why I am a bit skeptical of their results -- it's hard to imagine that they found a completely new analysis of Langevin dynamics that simulatenously holds in their more general setup. If the bias term is indeed only O(1)O(1) then I don't think the paper is strong enough, as it merely asserts the objective doesn't blow up.

其他意见或建议

  • I don't follow equation 24.
  • Equation 46 uses μ~\tilde \mu without defining it (reading onwards I suppose that it is meant to be a distribution coming from the mean-value theorem).
作者回复

Thank you very much for your critical feedback, especially your sharp observations about the order of the bias term. Below, we address your concerns. \newcommand{\bE}{\mathbb{E}} \newcommand{\o}{\omega}

Addressing Weaknesses

Bias term of second moment: Indeed, the norm control in the submitted version (adopted from [3, Lemma 1]) is sub-optimal. Below, please find a stronger control, adopted from [2, Lemma 1]. The corrected expression for rg2r_{g2} should be rg2=2M12+2τ2\bEμ0[theta022]+4M12+τ(4η2M12+4τd)r_{g2}= 2M^2_1+2\tau^2\bE_{\mu_0}[\\|\\theta_0\\|^2_2]+4M^2_1+ \tau(4\eta_2M^2_1+4\tau d) and similarly for rh2r_{h2}. They are of order O(1)O(1), thereby the bias term R1R_1 in Theorem 1 is O(η1)O(\eta_1).

\bEνk+1[\ok+12]=\bEνkρ[\ok+η2hk+2η2τξk22]=\bEνk[\ok2]+2\bEνkρ[\ok,η2(hk+τ\okτ\ok)+2η2τξk2]+\bEνkρ[η2hk+2η2τξk22]\bEνk[\ok2]+2η2M1\bEνk[\ok]2η2τ\bEνk[\ok2]+2η22(M12+τ2\bEνk[\ok2])+2η2τd(12η2τ+η2τ2+2η22τ2)\bEνk[\ok2]+2η2M12/τ+2η22M12+2η2τd(1η2τ)\bEνk[\ok2]+η2(2M12/τ+2η2M12+2τd),\begin{aligned} &\bE_{\nu_{k+1}}[\\|\o_{k+1}\\|^2]\\\\ =&\bE_{\nu_k\otimes \rho}[\\|\o_k+\eta_2 h_k+\sqrt{2\eta_2\tau}\xi^2_k\\|^2]\\\\ =&\bE_{\nu_k}[\\|\o_k\\|^2]+2\bE_{\nu_k\otimes \rho}[\langle \o_k,\eta_2 (h_k+\tau\o_k-\tau\o_k)+\sqrt{2\eta_2 \tau}\xi^2_k\rangle]+\bE_{\nu_k\otimes \rho}[\\|\eta_2h_k+\sqrt{2\eta_2 \tau} \xi^2_k\\|^2]\\\\ \leq&\bE_{\nu_k}[\\|\o_k\\|^2] + 2\eta_2 M_1 \bE_{\nu_k}[\\|\o_k\\|]-2\eta_2\tau \bE_{\nu_k}[\\|\o_k\\|^2]+2\eta^2_2( M^2_1 + \tau^2 \bE_{\nu_k}[\\|\o_k\\|^2]) +2\eta_2\tau d\\\\ \leq& (1-2\eta_2\tau +\frac{\eta_2\tau}{2}+2\eta^2_2\tau^2) \bE_{\nu_k}[\\|\o_k\\|^2] + 2\eta_2 M^2_1/\tau +2\eta^2_2 M^2_1 + 2\eta_2\tau d\\\\ \leq& (1-\eta_2\tau)\bE_{\nu_k}[\\|\o_k\\|^2] +\eta_2(2M^2_1/\tau+2\eta_2M^2_1+2\tau d), \end{aligned}

where the first equality uses the fact that ξk2\xi_k^2 is an independent zero-mean normal; the first inequality follows the fact that hk+τ\ok=δJδν\\|h_k+\tau\o_k\\| = \\|\nabla \frac{\delta J}{\delta \nu} \mu_{k+1}, \nu_k (\ok)M1 (\o_k)\\| \leq M_1; the second inequality is because 2M1\bEνk[ok]τ2(\bEνk[\ok])2+2M12/ττ2\bEνk[\ok2]+2M12/τ2M_1\bE_{\nu_k}[\\|\\o_k\\|]\leq \frac{\tau}{2} (\bE_{\nu_k}[\\|\o_k\\|])^2 + 2M^2_1/\tau \leq \frac{\tau}{2}\bE_{\nu_k}[\\|\o_k\\|^2]+2M^2_1/\tau and last inequality holds for τ<1/(4η2)\tau<1/(4\eta_2). Hence, we obtain

\bEνk[\ok2](1η2τ)k\bEν0[\o02]+2η2(M12/τ+η2M12+τd)j=0k1(1η2τ)j\bEν0[\o02]+2(M12/τ+η2M12+τd)τ,\begin{aligned} \bE_{\nu_k}[\\|\o_k\\|^2] & \leq (1-\eta_2\tau)^k \bE_{\nu_0}[\\|\o_0\\|^2]+ 2\eta_2 (M_1^2/\tau + \eta_2 M_1^2 + \tau d )\sum_{j=0}^{k-1}(1 - \eta_2\tau)^j\\\\ &\leq \bE_{\nu_0}[\\|\o_0\\|^2]+ \frac{2(M^2_1/\tau + \eta_2 M^2_1 + \tau d)}{\tau}, \end{aligned}

The remaining part of the proof stays the same. Hence, we can directly check that the remaining constants rgir_{gi}, rhir_{hi} as well as Γ0\Gamma_0, Γ1\Gamma_1, Γ2\Gamma_2 are all of order O(1)O(1). Since η1\eta_1 and η2\eta_2 are of the same order, we conclude that the bias term R1=λ(Γ2η22+(12η2τα)(Γ1+Γ0)η12)+Γ1η12η1ταR_1= \frac{\lambda(\Gamma_2\eta_2^2+(1-2\eta_2\tau\alpha)(\Gamma_1+\Gamma_0)\eta_1^2)+\Gamma_1\eta_1^2}{\eta_1\tau\alpha} is of order O(η1)O(\eta_1).

Regarding your question on the definition of constants in (18), they are defined in the proof of Theorem 2 (Line 1214, etc.). We apologize for not explicitly displaying them.

Assumptions of LSI in examples: Indeed, we have provided two applications in Section 4, with the LSI being verified in Proposition 3 and Proposition 4, respectively. This follows a line of work like [1, proposition 5.1] and [2, Appendix A]. The setups therein are satisfied in our context.

Existence and uniqueness of optimal μ\*,ν\*\mu^\*,\nu^\* : The primal objective J(μ,ν)J(\mu, \nu) is convex in μ\mu and concave in ν\nu. With an additional KL (or entropy) regularization, the regularized objective E(μ,ν)E(\mu, \nu) becomes strongly convex in μ\mu and strongly concave in ν\nu. This ensures the existence and uniqueness of the mixed Nash equilibrium (μ,ν)(\mu^*, \nu^*); please refer to the detailed discussion before equation (3) in the manuscript as well as [2, Proposition 2.1].

Addressing Questions

A1. To get (24), the first line follows directly from equation (23); the second line follows from the definition of TT; the third line follows from the Taylor's expansion of logdet\log {\rm det}, where \oˉk\bar{\o}_k is the point that achieves the mean-value; and the fourth line follows from the property of trace operator and that Tr(hk2)=hkF2=(2δJδνFτ)2(M2+τ)2{\rm Tr}(h_k^2) =\\|\nabla h_k\\|_F^2 = (\\|\nabla^2 \frac{\delta J}{\delta \nu}\\|_F - \tau)^2 \leq (M_2 + \tau)^2.

A2. Yes, you're absolutely right.

Thank you so much again for your insightful and constructive feedback. We hope our responses alleviate your concerns.

References:

[1]Chizat, L. Mean-field Langevin dynamics: Exponential convergence and annealing. Transactions on Machine Learning Research, 2022.

[2]Suzuki, T., Wu, D., and Nitanda, A. Mean-field Langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction. Advances in Neural Information Processing Systems, 36, 2024.

[3] Nitanda, Atsushi, Denny Wu, and Taiji Suzuki. Convex analysis of the mean field langevin dynamics. International Conference on Artificial Intelligence and Statistics, 2022.

审稿人评论

Thank you for your response, it does seem like this addresses the order of the bias issue. I also apologize for missing Propositions 3 and 4. I have updated my score accordingly.

作者评论

Thank you very much for your thoughtful comments and for taking the time to revisit your evaluation. We're glad to hear that the updated version addresses the issue regarding the order of the bias.

审稿意见
3

This paper studies the mean-field Langevin (stochastic) descent-ascent (MFL-DA) algorithm for solving distributional minimax optimization problems. The authors demonstrate that the infinite-particle limit of discrete-time MFL-DA is able to converge to the unique stationary point of the problem with a convergence rate of 1ϵlog1ϵ\frac{1}{\epsilon}\log\frac{1}{\epsilon} measured in squared 2-Wasserstein distance. The authors show applications of their result to finding mixed Nash equilibria of zero-sum games, generative adversarial networks, and mean-field neural networks.

给作者的问题

  1. It seems that 1α1\frac{1}{\alpha_1} should be replaced with α1\alpha_1 in Corollary 1, otherwise we can drive the Wasserstein distances to zero by simply letting α10\alpha_1 \to 0.

  2. What is the optimal value of λ\lambda for Theorems 1, 2, and Corollary 1?

  3. Does Assumption 4 hold point-wise for all θ\theta and ω\omega? I believe the expectation is over the random noise for estimating gkg_k and hkh_k, which still leaves out θ\theta and ω\omega.

  4. Do we expect the constants that appear in the paper, such as those in Assumptions 1-4 and in Proposition 3 to be dimension-dependent in typical settings?

  5. Equation (8) seems to have a typo, why are you using the expectation notation?

  6. Typo in the first line of Equation (9), I believe ν\nu should be ν\nu’.

论据与证据

The claims seem clear and well-supported.

方法与评估标准

Not applicable since this is a theory paper.

理论论述

I did not go over the details of the proof but the overall claims seem to be consistent with the literature and the proof technique seems sound from a high level.

实验设计与分析

Not applicable.

补充材料

I only looked at the overall proof strategy and did not review the details.

与现有文献的关系

This paper contributes to a long line of work of applications of the mean-field Langevin dynamics for optimization in the space of probability distributions. As discussed in the paper, this is (to my knowledge) the first discrete-time convergence guarantee for mean-field Langevin descent-ascent for solving distributional min-max problems. Both the algorithm and the problem are of significant interest to the community.

遗漏的重要参考文献

Most essential references are discussed in the paper. I think the authors can also discuss [1] where the mean-field Langevin algorithm is used for optimization over signed measures. Specifically, that paper contains ideas for going beyond pessimistic LSI constant estimates which can be useful for the results here as well, and similar to this paper, they also need a two-timescales approach for their analysis.

[1] G. Wang, A. Mousavi-Hosseini, L. Chizat. "Mean-Field Langevin Dynamics for Signed Measures via a Bilevel Approach." NeurIPS 2024.

其他优缺点

Strengths:

Solving minmax optimization on the space of distributions is a fundamental problem, and it is nice to have discrete-time convergence guarantees for the mean-field Langevin descent-ascent algorithm. Also, the paper is mostly well-written and easy to ready.

Weakness:

  • A main concern for me is that there is almost no discussion on the role of other parameters besides ϵ\epsilon in the convergence rate. A major weakness of this type of mean-field Langevin analysis is that convergence requires τ\tau to be at least linearly small with ambient dimension dd. When plugged into the pessimistic LSI bound, this implies α\alpha that is exponentially small in dd, and thus a convergence rate that is exponentially large in dd. This is a drawback of this type of analysis and not a weakness of this paper in particular, but I think it is better to be explicitly discussed.

  • Similarly, the role of other parameters is not clear/made explicit. For example, is it better to have η1η2\eta_1 \gg \eta_2 or η1η2\eta_1 \ll \eta_2? How do quantities like α\alpha, σ2\sigma^2, and ζ\zeta enter the final convergence rate? While these questions might be answered by following certain quantities in the appendix, I think it would be nice to have summaries of the convergence rate in the main text.

  • The convergence analysis yields a bound on L\mathcal{L} and the Wasserstein distance to μ\mu^* and ν\nu^*. Can such bounds be turned into a bound on the suboptimality E(μK,νK)E(μ,ν)E(\mu_K,\nu_K) - E(\mu^*,\nu^*)? I am asking this in particular since bounding this suboptimality is possible for distributional minimization problmes with mean-field Langevin.

  • Additional suggestions are discussed below.

其他意见或建议

Please see below.

作者回复

Thank you for your thoughtful and detailed feedback! Below, we address the comments and questions point-by-point.

Pessimistic LSI bound: We will explicitly comment on the weakness of this type of analysis, and we are trying to overcome this in our ongoing research.

Explicit dependence on other parameters: In the revised version, we will include a dedicated paragraph discussing the roles of parameters and here we provide a sketch.

On dimension dd and regularization parameters: We agree that a more explicit discussion of the role of parameters in the convergence rate would be beneficial. Viewing τ,η1,η2\tau,\eta_1,\eta_2 as small numbers, in Appendix A.3 on page 11, the remainder rg2,rh2r_{g2}, r_{h2} due to the second moment are maxO(1),τd\max\\{O(1),\tau d\\}, similarly rg3,rh3r_{g3}, r_{h3} are maxO(1),τ3/2d3/2\max \\{ O(1),\tau^{3/2}d^{3/2} \\} . Substituting them into Γ0,Γ1,Γ2\Gamma_0, \Gamma_1, \Gamma_2, we can get Γ0=maxO(1),τd,τ2d2,dα1/2τ\Gamma_0=\max\\{O(1), \tau d,\tau^{2}d^{2},\frac{d}{\alpha^{1/2}\tau} \\}, and Γ1(2)=maxO(1),τd,τ2d2\Gamma_{1(2)}=\max\\{O(1),\tau d,\tau^2d^2\\}. Substutiuting them into R1R_1 in (10), we can get R1=O(d2η1τ3α3)R_1=O(\frac{d^2\eta_1}{\tau^3\alpha^3}). Let R1=ϵR_1=\epsilon and choose η1=O(ϵτ3α3d2)\eta_1= O(\frac{\epsilon \tau^3 \alpha^3}{d^2}), we obtain a sample complexity K=O(d2ϵτ4α4log1ϵ)K=O(\frac{d^2}{\epsilon \tau^4 \alpha^4}\log \frac{1}{\epsilon}). Comparing with sample complexity K=O(d2ϵτ2α2log1ϵK=O(\frac{d^2}{\epsilon \tau^2 \alpha^2}\log \frac{1}{\epsilon}) in [1] for the MFLD of minimization problems, the higher orders of τ,α\tau, \alpha is because the two-timescale algorithm MFL-SDA: the overall complexity of the algorithm depends on the slower descent step.

On other parameters: The variance ζ,σ2\zeta,\sigma^2 only affect the O(1)O(1) term in Γi\Gamma_i. The parameters η1,η2\eta_1,\eta_2 correspond to the fast descent and fast ascent regime similar to [3], and thus their choice depends on the instance and the user's emphasis on the descent/ascent part.

Another standard of suboptimality: Our problem is to find a saddle point rather than a maxima(mimima), hence E(μK,νK)E(μ,ν)E(\mu_K,\nu_K)-E(\mu^*,\nu^*) may not apply to our analysis.

Addressing References Not Discussed:

Thank you for pointing this out! We will cite this paper in the revision and investigate if their Theorem 5.2 can inspire a better LSI bound for our problem.

Addressing Questions:

A1. Thank you for catching this: the Talagrand's inequality should be W22(μk,ν)2αKL(μkμ)W_2^2(\mu_k,\nu^*)\leq \frac{2}{\alpha}\mathrm{KL}(\mu_k|\mu^*).

A2. λ\lambda corresponds to the Lyapunov function similar to [2] [3], and it is not an explicit hyperparameter in our algorithm.

A3. Yes, it holds point-wise for all θ\theta and ω\omega.

A4. See our response on the dimension above.

A5, A6. Thank you for noticing this. We will correct the notation.

We thank you very much again for the constructive feedback and helpful suggestions!

Reference:

[1] Nitanda, A., Wu, D., & Suzuki, T. (2022). Convex analysis of the mean field Langevin dynamics. In International Conference on Artificial Intelligence and Statistics (pp. 9741-9757). PMLR.

[2] Lu, Y. (2023). Two-scale gradient descent ascent dynamics finds mixed Nash equilibria of continuous games: A mean-field perspective. In International Conference on Machine Learning (pp. 22790-22811). PMLR.

[3] Yang, J., Kiyavash, N., & He, N. (2020). Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems. arXiv preprint arXiv:2002.09621.

审稿意见
3

The paper analyzes a Langevin-type scheme for finding equilibria in mean-field games under convexity and smoothness assumptions. The rates obtained scale as O~(1/ε)\widetilde{O}(1/\varepsilon), which agrees with the rate in Euclidean space. An extension to stochastic gradients is also considered.

给作者的问题

What is the dependence of the result on the dimension, and the condition numbers (Hessian bounds, etc.)? How does it compare to prior work?

It is a bit strange that third-moment boundedness is needed for the stochastic gradient oracle, when normally it is only the second moment that is required. Can the authors comment if they believe this requirement is fundamental?

Although the paper attains a sharp rate, this is for the mean-field version of the algorithm. It is not clear whether this type analysis can be carried out using this framework for a finite particle algorithm (accounting for additional error of the particles). Can the authors comment?

论据与证据

The main selling point of this work is the O~(1/ε)\widetilde{O}(1/\varepsilon) rate, under standard assumptions on the boundedness of (derivatives of) differentials of the objective. This result is state-of-the-art compared to prior work (which generally has higher complexity). There is also an extension to stochastic gradient oracles when the oracle has bounded error in the second and third moments. The authors provide solid proofs for their claims.

方法与评估标准

This is not applicable to this paper.

理论论述

The results make sense and the proofs appear rigorous; I checked all the results (without verifying all technical details).

实验设计与分析

The paper is theoretical in nature and therefore does not contain an experimental component. However, some ramifications for typical applications are discussed.

补充材料

I skimmed the proof in the appendix. However, it is rather technical and I did not have the opportunity to review all the details.

与现有文献的关系

The paper is related to prior work on mean-field games, in particular to Langevin-type algorithms for computing equilibrium points. The main claim of this paper is that it improves the rates compared to those prior works (at least with respect to the inverse accuracy dependence). Earlier work had a super-linear dependence on ε\varepsilon or required an inner-outer loop complexity.

遗漏的重要参考文献

As far as I am aware, the most relevant references have been covered.

其他优缺点

Strengths:

I believe this work is impactful as it makes a significant improvement to the rate estimate for an important statistical problem.

The rates are examined in the context of various applications, with interpretable and clear results.

Weaknesses:

The result is mainly theoretical, as in general a particle-algorithm is needed for implementability (which complicates the rate). This is agreed upon by the authors; it is unclear to me whether the analysis scheme in this paper can be preserved after particle discretization.

It would be helpful if the authors included other important parameters in their rate estimates, such as the dimensions.

其他意见或建议

Line 143: “We set the scaling factor the weight decay the regularization” does not make sense, please revise.

Equation 9 exceeds the column spacing. This occurs in multiple other places; please fix this.

作者回复

Thank you very much for your thorough and constructive feedback. Below, we address your main points raised. \newcommand{\bs}{\boldsymbol}

Addressing Weakness:

Particle Discretization: For this problem, we have verified that our method is feasible in the particle setting. As an illustration, we briefly summarize the adjustments needed in our proof technique under the particle approximation setting:

We set the joint distribution \bsμ,\bsν\bs\mu,\bs\nu as NN finite particles \bsθ=[θi]i=1N\bs\theta=[\theta^i]^N_{i=1} and \bsω=[ωi]i=1N\bs \omega = [\omega^i]^N_{i=1}. In this case, the LSI has the form adapted from [1, Lemma 7]

τ2Ni=1NE\bsνk[hi(\bsωk)log\bsρk(\bsωk)22]2ατL2N(\bsμk+1,\bsνk)+O(1N) \frac{\tau^2}{N} \sum^N_{i=1} \mathbb E_{\bs\nu_k}[\\|h_i(\bs\omega_k)-\nabla \log \bs\rho_k(\bs\omega_k)\\|^2_2]\geq 2\alpha \tau {\mathcal L_2^N}(\bs\mu_{k+1},\bs\nu_k)+O(\frac{1}{N})

where \bsμk\bs\mu_k and \bsνk\bs\nu_k denote the distribution of \bsθ\bs\theta and \bsω\bs\omega in the kk-th iteration, respectively; hih_i denotes the partial derivative w.r.t the ii-th particle; L2N\mathcal{L}_2^N denotes the Lynapunov function for the counterpart on the product space; \bsρ\bs\rho denotes the Gibbs distribution; O(1N)O(\frac{1}{N}) is from the propagation of chaos. Other log-Sobolev inequalities used in our setting have similar forms. Thereby, most derivations in the manuscript involving Taylor expansions can be extended to analyze the joint distribution of particles, but with additional approximation error O(1/N)O(1/N) due to weak particle interactions. The per-step improvement would be of the form

L(\bsμk+1,\bsνk+1)(12η1τα)L(\bsμk,\bsνk)+O(η12)+O(η1N)\mathcal L(\bs\mu_{k+1},\bs\nu_{k+1})\leq (1-2\eta_1\tau\alpha)\mathcal L(\bs\mu_{k},\bs\nu_{k}) + O(\eta_1^2) + O(\frac{\eta_1}{N})

where the O(η1N)O(\frac{\eta_1}{N}) term comes from the propagation of chaos. Building on our current line of thought, we believe this approach can be extended to the stochastic gradient setting while maintaining the same sample complexity order as well. However, due to the extensive computations involved, we will present a complete proof of our results in an upcoming extended journal version.

Parameters: Due to the space limit, please refer to our response to Reviewer 5VGF for a detailed analysis of how the sample complexity in Theorems 1 and 2 depends on various parameters, including the dimension dd. Notably, the sample complexity of MFL-AG in [2] is O(ϵO(1/α))O(\epsilon^{-O(1/\alpha)}), which is exponential in the LSI constant α\alpha, hence in dd under their conservative bound O(1/(2+d/2))O(1/(2+d/2)) of α\alpha, whereas our bound does not suffer from this exponential dependence. The Hessian bound C1C_1 occurs in bias Γi=O(C1)\Gamma_i=O(C_1). As for the condition number, adopting the notion of effective condition number in [3, Theorem 2.1], our bound has a similar dependence in the case of zero-sum game.

Addressing Questions:

Line 143 and Equation (9): Thank you for pointing this out. We will correct the issues in the updated version of the manuscript.

Boundedness of third-moment: The existence of this issue arises from the presence of Eμk[gk23]\mathbb{E}_{\mu_k}[\\|g_k\\|_2^3] in our remainder term. Therefore, in the stochastic gradient part, to ensure that this remainder term remains uniformly bounded, we need to impose a bound on the third-order moment. In fact, [1] assumed that the stochastic gradient is uniformly bounded with respect to any ω\omega and θ\theta, which is stronger than ours.

Thank you again for your valuable feedback! We will correct the typos and formatting issues as you suggested.

References

[1] Suzuki, T., Wu, D., & Nitanda, A. (2023). Convergence of mean-field Langevin dynamics: time-space discretization, stochastic gradient, and variance reduction. Advances in Neural Information Processing Systems, 36, 15545-15577.

[2] Kim, J., Yamamoto, K., Oko, K., Yang, Z., & Suzuki, T. (2023). Symmetric mean-field langevin dynamics for distributional minimax problems. arXiv preprint arXiv:2312.01127.

[3] Lu, Yulong. Two-scale gradient descent ascent dynamics finds mixed Nash equilibria of continuous games: A mean-field perspective. International Conference on Machine Learning. PMLR, 2023.

审稿人评论

I thank the authors for their response. Although the authors have indicated that a particle discretization analysis is forthcoming in an extended edition of this paper, I believe such a result is integral to this type of theoretical analysis. As a result, I remain lukewarm on the current draft and will opt to maintain my score.

作者评论

We thank the reviewer for their thoughtful feedback. We fully agree that a particle discretization analysis is important for this line of theoretical work.

To draw a full picture of particle case, we write a detailed setting as follows:

We set the distribution μ,ν\mu,\nu using NN finite particles θ=[θi]i=1N\boldsymbol \theta=[\theta^i]^N_{i=1} and ω=[ωi]i=1N\boldsymbol \omega = [\omega^i]^N_{i=1} given by μθ=1Ni=1Nδθi\mu_\theta=\frac{1}{N}\sum_{i=1}^N \delta_{ \theta^i} and νω=1Ni=1Nδωi\nu_\omega=\frac{1}{N}\sum_{i=1}^N {\delta}_{ \omega^i}. In this case, the algorithm becomes:

For k=1,2,,K1k=1,2,\ldots, K-1 do:

For all particles i=1,2,,Ni=1,2,\ldots,N sample ξkμ,iN(0,Id),ξkν,iN(0,Id)\xi_k^{\mu,i}\sim {\cal N}(0, I_d), \xi_k^{\nu,i}\sim {\cal N}(0, I_d) do:

θk+1iθkiη1N(i=1NδJδμ\theta^i_{k+1} \leftarrow \theta^i_k- \frac{\eta_1}{N} (\sum^N_{i=1} \nabla \frac{\delta J}{\delta \mu} \mu_k,\nu_k (θki)+τθki)+2η1τξkμ,i. (\theta^i_k)+\tau \theta^i_k)+\sqrt{2\eta_1 \tau}\xi^{\mu,i}_k.

ωk+1iωki+η2N(i=1NδJδν\omega^i_{k+1} \leftarrow \omega^i_{k} + \frac{\eta_2}{N}(\sum_{i=1}^N \nabla \frac{\delta J}{\delta \nu} \mu_{k+1},\nu_k (ωki)+τωki)+2η1τξkν,i. (\omega^i_k)+\tau \omega^i_k)+\sqrt{2\eta_1\tau}\xi^{\nu,i}_k.

The main differences between particle case and mean-field case focus on these four aspects:

  1. Change of probability spaces: Although μθ=1Ni=1Nδθi\mu_\theta=\frac{1}{N}\sum_{i=1}^N \delta_{ \theta^i} and νω=1Ni=1Nδωi\nu_{\omega}=\frac{1}{N}\sum^N_{i=1} \delta^i_\omega can be seen as a mixture of atom measures in space Rd\mathbb R^d, however, what we focus on is the parameter space \theta^i_k i=1N ^N_{i=1} and \omega^i_k i=1N ^N_{i=1} is RNd\mathbb R^{Nd}. So the each-step update in Lemma 1 will consider a iteration on space P(RNd)\mathcal P(\mathbb R^{Nd}), which will include subtler analysis.
  2. Change of Gibbs measures: As we have said in 1, the target measure is no longer particle approximation but the parameter space P(RNd)\mathcal P(\mathbb R^{Nd}), as in Kim et. al.(2024), there will be a change in the definition of Gibbs operator,
K+[μ]μNexp(Ni=1Ngki).\mathcal K^+[\mu]\propto \mu^{\otimes N}\exp(-N\sum^N_{i=1}g_k^i).

Hence the Log-Sobolev inequality should also change due to this issue. 4. During the update process, the correlation between one particle and itself cannot be omitted, and the correlation will be larger if NN is smaller.

Hence, following by our proof analysis, here are the parts that we need to modify:

Boundedness of gradients: Since every Wasserstein gradient becomes a weighted sum of gradient of each particles, e.g. eq(38) will becomes

Eνk[ωk22]=1Ni=1NEνk[ωki22]\mathbb E_{\nu_k}[\\|\omega_k\\|^2_2]= \frac{1}{N}\sum^N_{i=1} \mathbb E_{\nu_k}[\\|\omega^i_k\\|^2_2]

And the iteration will become

Eνk[ωki22]=Eνk1ρ[ωk1i+η2hk+2η2τξki22].\mathbb E_{\nu_k}[\\|\omega^i_k\\|^2_2]=\mathbb E_{\nu_{k-1}\otimes\rho}[\\|\omega^i_{k-1}+\eta_2h_k+\sqrt{2\eta_2\tau}\xi_k^i\\|^2_2].

Where hkh_k is similar in the algorithm. Hence, all the parts that contains gradient will add a term related with NN.

Correspondence in Second-order Variation: For example, in eq(52), when μ,ν\mu,\nu are all particle measures, it will become

1Ni,j=1N(H(ωk+1i,ωk+1j)H(ωk+1i,ωkj)H(ωki,ωk+1i)+H(ωki,ωkj)).\frac{1}{N}\sum_{i,j=1}^N (H(\omega_{k+1}^i,\omega_{k+1}^j)-H(\omega_{k+1}^i,\omega_k^j)-H(\omega_k^i,\omega_{k+1}^i)+H(\omega_k^i,\omega_k^j)).

When i=ji=j, the dependence with particle and itself will introduce a bias term O(η)O(\eta).

Failure of LSI: Since the suboptimality of Gibbs operators, the LSI eq.(36), eq.(64) will fail. While [1] gave a solution to this case in Lemma 7, which has form

τ2Ni=1NE[hωklogνωk22]2ατL2(μθk+1,νωk)+O(1N)\frac{\tau^2}{N} \sum^N_{i=1} \mathbb E[\|h_{\omega_k}-\nabla \log \nu_{\omega_k}\|^2_2]\geq 2\alpha \tau {\mathcal L_2}(\mu_{\theta_{k+1}},\nu_{\omega_k})+O(\frac{1}{N})

After these clarification, most of derivations we did such as Taylor expansions are safe after replacing mean-field case with particle case. However, since all parts of our proof needs to be rewritten with form related with particle number NN, plus the above three main differences, these additional analysis is rather involved, potentially doubling or even tripling its size. For this reason, and to maintain clarity within the scope of a conference submission, we have chosen to defer this component to an extended version. We appreciate the reviewer’s understanding and thoughtful consideration.

References

[1] Suzuki, T., Wu, D., & Nitanda, A. (2023). Convergence of mean-field Langevin dynamics: time-space discretization, stochastic gradient, and variance reduction. Advances in Neural Information Processing Systems, 36, 15545-15577.

最终决定

This paper presents a convergence analysis of the discrete-time mean-field Langevin stochastic descent-ascent (MFL-SDA) algorithm for solving distributional minimax optimization problems. The authors establish an O(1ϵlog(1ϵ))O(\frac{1}{\epsilon} \log(\frac{1}{\epsilon})) convergence rate under log-Sobolev conditions, which is nearly optimal. A key technical contribution is the direct discrete-time analysis without resorting to continuous-time PDE arguments.

The work is theoretically well-motivated and relevant to key applications, including GANs, zero-sum games, and robust learning. All reviewers leaned toward acceptance, with several increasing their scores after a detailed and thoughtful rebuttal.

That said, reviewer 5VGF pointed out a crucial issue: the dependence of convergence rate on parameters such as the dimension, regularization parameter, and LSI constant was suppressed in the initial submission. The authors responded with clear expressions, but this content should be more prominently discussed in the final version.

Overall, this paper advances the theoretical understanding of Langevin-type algorithms in minimax distributional settings and deserves to be accepted.