PaperHub
6.5
/10
Poster4 位审稿人
最低5最高9标准差1.5
5
6
9
6
4.0
置信度
正确性3.3
贡献度3.0
表达3.5
NeurIPS 2024

Last-Iterate Convergence for Generalized Frank-Wolfe in Monotone Variational Inequalities

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

This paper establishes last-iterate convergence rate for a generalized Frank-Wolfe algorithm for solving monotone variational inequality problems.

摘要

We study the convergence behavior of a generalized Frank-Wolfe algorithm in constrained (stochastic) monotone variational inequality (MVI) problems. In recent years, there have been numerous efforts to design algorithms for solving constrained MVI problems due to their connections with optimization, machine learning, and equilibrium computation in games. Most work in this domain has focused on extensions of simultaneous gradient play, with particular emphasis on understanding the convergence properties of extragradient and optimistic gradient methods. In contrast, we examine the performance of an algorithm from another well-known class of optimization algorithms: Frank-Wolfe. We show that a generalized variant of this algorithm achieves a fast $\mathcal{O}(T^{-1/2})$ last-iterate convergence rate in constrained MVI problems. By drawing connections between our generalized Frank-Wolfe algorithm and the well-known smoothed fictitious play (FP) from game theory, we also derive a finite-sample convergence rate for smoothed FP in zero-sum matrix games. Furthermore, we demonstrate that a stochastic variant of the generalized Frank-Wolfe algorithm for MVI problems also converges in a last-iterate sense, albeit at a slower $\mathcal{O}(T^{-1/6})$ convergence rate.
关键词
Monotone variational inequalitiesgeneralized Frank-Wolfe methodlast-iterate convergencesmoothed fictitious-play

评审与讨论

审稿意见
5

The paper presents a regularized version of Frank-Wolfe algorithm for monotone variational inequalities for which the authors an O~(T1/2)\tilde{O}(T^{-1/2}) convergence rate of the last iterate. In the stochastic case, using the variance reduction technique, the authors show an algorithm that enjoys O~(T1/6)\tilde{O}(T^{-1/6}) last-iterate convergence.

优点

  • The presentation and the analysis of the paper are mostly straightforward and easy to follow.
  • Using regularization and variance reduction is a simple idea but it appears to work quite well.

缺点

  1. There are some major issues with the proofs in the paper, as follows
  • In Line 808, second equation, wtw_t does not appear. How would the authors fix this issue? It seems non-trivial to me, although in the easier case when wt=0w_t=0, the proof will work.
  • Lemma 4.1: The optimality condition here is usually considered a very strong, ie, there exists a point such that the gradient = 0, especially when the function considered changes according to xx.
  • The assumption that there exists Fˉ=maxF(x)2\bar{F} = \max ||F(x)||_2 seems unusual and strong and the dependence of the convergence on Fˉ\bar{F} is not convincing to me.
  1. Empirically, there is a gap between extragradient and the proposed algorithm.

问题

Could the authors address my questions mentioned above? I would be happy to increase my score if these concerns are addressed adequately.

局限性

Yes

作者回复

Comment: In Line 808 second equation, wtw_t does not appear. How would the authors fix this issue? It seems non-trivial to me, although in the easier case when wt=0w_t=0, the proof will work.

Response: We greatly appreciate the reviewer for carefully reading our work. The missing wtw_t in Line 808 was indeed a typo. Next, we present a fix to this issue. We start with the decomposition of the term E1E_1 in Line 808:

E1=(stst+1)(F(xt+1)F(xt))+(xtst)(F(xt+1)F(xt))(1)E_1=(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))+(x_t-s_t)^\top (F(x_{t+1})-F(x_t))\quad\qquad (1)

For the first term on the right-hand side of the previous inequality, we have by Assumption 4.1 and Lemma 4.1 that

(stst+1)(F(xt+1)F(xt))stst+1F(xt+1)F(xt)L2τσfxt+1xt2.(2)(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))\leq \vert|s_t-s_{t+1}\vert|\vert|F(x_{t+1})-F(x_t)\vert| \leq \frac{L^2}{\tau \sigma_f}\vert|x_{t+1}-x_t\vert|^2.\quad (2)

This is the same as what we did in Line 808 (from the first inequality to the third inequality).

The typo was with our approach in bounding the second term on the right-hand side of Eq. (1). Next, we present a different approach to bound it, which requires the following additional but mild assumption.

Assumption: The Jacobian matrix J()J(\cdot) of the operator F()F(\cdot) is Lipschitz continuous.

Since F()F(\cdot) is Lipschitz continuous (Assumption 4.1), its Jacobian matrix is well-defined almost everywhere. Assumption imposes a mild smoothness condition on the Jacobian matrix, which is automatically satisfied in the zero-sum game setting, where the operator F()F(\cdot) is a linear operator. Note that the monotonicity of F(x)F(x) implies that J(x)+J(x)J(x)+J(x)^\top is a positively semidefinite matrix for any xXx\in\mathcal{X}.

Denote the Lipschitz constant of J()J(\cdot) as LJL_J. To proceed, in view of the second term on the right-hand side of Eq. (1), for a given vRdv\in\mathbb{R}^d, consider the function gv(x)=vF(x)g_v(x)=v^\top F(x). Since gv(x1)gv(x2)=(J(x1)J(x1))vLJvx1x2,\vert|\nabla g_v(x_1)-\nabla g_v(x_2)\vert|=\vert|(J(x_1)-J(x_1))^\top v\vert|\leq L_J\vert|v\vert|\vert|x_1-x_2\vert|, the function gv(x)g_v(x) is an LJvL_J\vert|v\vert|-smooth function with respect to xx. Therefore, using the descent lemma, we have for any x1,x2Xx_1,x_2\in\mathcal{X}: gv(x2)gv(x1)(x2x1)J(x1)v+LJv2x1x22.g_v(x_2)-g_v(x_1)\leq (x_2-x_1)^\top J(x_1)^\top v+\frac{L_J\vert|v\vert|}{2}\vert|x_1-x_2\vert|^2. Identifying v=xtstv=x_t-s_t, x1=xtx_1=x_t, and x2=xt+1x_2=x_{t+1} in the above inequality, the second term on the right-hand side of Eq. (1) can be bounded as follows: (xtst)(F(xt+1)F(xt))(x_t-s_t)^\top (F(x_{t+1})-F(x_t))\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad (xt+1xt)J(xt)(xtst)+LJxtstxt+1xt22\leq (x_{t+1}-x_t)J(x_t)^\top (x_t-s_t)+\frac{L_J\vert|x_t-s_t\vert|\vert|x_{t+1}-x_t\vert|^2}{2}\qquad\qquad\qquad\qquad\qquad\qquad\qquad =αt(xtst+wt)J(xt)(xtst)+LJxtstxt+1xt22= -\alpha_t(x_t-s_t+w_t)J(x_t)^\top (x_t-s_t)+\frac{L_J\vert|x_t-s_t\vert|\vert|x_{t+1}-x_t\vert|^2}{2}\qquad\qquad\qquad\qquad\qquad αt(xtst)[J(xt)+J(xt)2](xtst)αtwtJ(xt)(xtst)+2LJDX2xt+1xt2\leq -\alpha_t(x_t-s_t)^\top \left[\frac{J(x_t)+J(x_t)^\top}{2}\right] (x_t-s_t)-\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{2L_JD_{\mathcal{X}}}{2}\vert|x_{t+1}-x_t\vert|^2

where the first equality follows from the update equation for xtx_t, the second inequality follows from xtstxt+st2DX\vert|x_t-s_t\vert|\leq \vert|x_t\vert|+\vert|s_t\vert|\leq 2D_{\mathcal{X}} (recall that DXD_{\mathcal{X}} is the radius of the feasible set), and the last inequality follows from J(xt)+J(xt)J(x_t)+J(x_t)^\top being a positively semidefinite matrix (due to the monotonicity of F()F(\cdot)).

评论

Comment: Lemma 4.1: The optimality condition here is usually considered very strong, ie, there exists a point such that the gradient = 0, especially when the function considered changes according to xx.

Response: To clarify, the optimality condition is actually a natural consequence of our analysis rather than an assumption. To elaborate, note that the function sF(x)+τf(s)s^\top F(x)+\tau f(s) as a function of ss is τσf\tau \sigma_f-strongly convex uniformly for all xXx\in\mathcal{X}. The strongly convex property comes from the regularizer τf(s)\tau f(s) and is independent of where xx is. Therefore, since the optimization problem argminsX{sF(x)+τf(s)}\arg\min_{s\in\mathcal{X}}\{s^\top F(x)+\tau f(s)\} of finding the Frank-Wolfe direction is a strongly convex problem on a convex and compact feasible set, there is a unique global minimizer. Moreover, since f()f(\cdot) is chosen such that limsXf(s)=+\lim_{s\rightarrow\partial \mathcal{X}}\vert|\nabla f(s)\vert|=+\infty (cf. Condition 4.1), the optimal solution of argminsX{sF(x)+τf(s)}\arg\min_{s\in\mathcal{X}}\{s^\top F(x)+\tau f(s)\} must lie in the interior of the feasible set X\mathcal{X} (otherwise we can easily argue a contradiction), which leads to the first-order optimality condition stated in the beginning of the proof of Lemma 4.1. In the next version, we will provide a more detailed illustration in the proof of Lemma 4.1 to support stating the optimality conditions this way.

Comment: The assumption that there exists Fˉ=maxF(x)2\bar{F}=\max\vert|F(x)\vert|_2 seems unusual and strong and the dependence of the convergence on Fˉ\bar{F} is not convincing to me.

Response: To clarify, this is also not an assumption but a natural consequence of our analysis. Since the feasible set X\mathcal{X} is compact, and the function F()F(\cdot) is Lipschitz continuous (hence continuous), the quantity Fˉ:=maxxXF(x)\bar{F}:=\max_{x\in\mathcal{X}}\vert|F(x)\vert| is well-defined and finite according to the extreme value theorem. We will clarify this point in the next version of this work. In our convergence bound, Fˉ\bar{F} only appears as a constant, hence does not impact the overall convergence rate.

Comment: Empirically, there is a gap between extragradient and the proposed algorithm.

Response: In Figure 1, it is not completely clear which algorithm has the better rate of convergence as Frank-Wolfe seems to decay faster in the beginning while EG has a better convergence rate afterward. Theoretically, both generalized Frank-Wolfe and EG have a comparable O(1/T)O(1/\sqrt{T}) rate of convergence. In the stochastic setting, EG fails to converge while our variant of Frank-Wolfe (i.e., Algorithm 3) has provable last-iterate convergence.

That being said, the goal of this paper was not to propose an algorithm that outperforms existing benchmarks. In recent years, existing studies of monotone variational inequalities have primarily focused on gradient-based algorithms such as EG and OG. On the other hand, Frank-Wolfe is also a natural algorithm (but is, to some extent, overlooked), and has a nice connection with fictitious play in zero-sum games. We study this algorithm and show that with a simple modification (that is, adding a regularizer), the convergence rate of Frank-Wolfe matches with that of EG and OG in the deterministic setting and is provably convergent in the stochastic setting.

评论

Combining Eqs. (2) and (3) with Eq. (1), assuming without loss of generality that the regularization parameter τ\tau is chosen to be less than 11, we have

E_1=(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))+(x_t-s_t)^\top (F(x_{t+1})-F(x_t))$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\left(\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\tau \sigma_f}\right)\vert|x_{t+1}-x_t\vert|^2$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\left(\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\sigma_f}\right)\frac{1}{\tau}\vert|x_{t+1}-x_t\vert|^2$$ $$= -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+ \frac{M}{\tau}\vert|x_{t+1}-x_t\vert|^2,\qquad\qquad\qquad(4)

where we denote M=2LJDX2+L2σfM=\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\sigma_f} in Eq. (4) for simplcity of notation. To proceed, using the same line of analysis as we did from the third inequality to the last inequality in Line 808, we have

E_1 \leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{M}{\tau}\vert|x_{t+1}-x_t\vert|^2\qquad\qquad\qquad$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{8\alpha_t^2MD_{\mathcal{X}}^2}{\tau} +\frac{2M\alpha_t^2}{\tau}\vert|w_t\vert|^2.

In view of the previous inequality, the first term on the right-hand side is mean zero due to {wt}\{w_t\} being a martingale difference sequence (cf. Assumption 4.2), and the rest of the terms are the same (up to a multiplicative constant that is independent of τ\tau) as the terms on the right-hand side of the last inequality in Line 808. Following the same analysis starting from Line 809, the overall finite-time bound (with the same rate of convergence) can be reproduced.

In summary, the convergence analysis for Algorithm 2 can be reproduced with the additional assumption that the Jacobian matrix of F()F(\cdot) is Lipschitz continuous. The analysis of Algorithm 1 for zero-sum games remains intact since the operator F()F(\cdot) in that case is linear, hence the Jacobian is a constant, which is automatically Lipschitz. The analysis for Algorithm 3 remains intact as it does not require the additional assumption.

It is an interesting future direction to investigate whether the additional Lipschitz continuity assumption on the Jacobian matrix is necessary for the convergence of Algorithm 2. Intuitively, when there is stochasticity in the algorithm, we need some form of smoothness in the update to control the stochastic error. For Algorithm 3, this is achieved through the algorithm design (by creating a timescale separation between the variance reduced stochastic estimate yty_t and the main iterate xtx_t). In Algorithm 2, since the noise is directly added to the update equation, we imposed the additional Lipschitz continuity assumption for the analysis.

评论

We greatly appreciate the reviewer’s time and effort in evaluating our work. Please let us know if our response addresses the concerns raised, or if further clarification is needed.

评论

I thank the authors for the response. The authors have addressed some parts of my concerns, I have raised the score accordingly. However, I also think that the paper needs extra work to address the proof gap and the new assumption as well as justifications for existing condition/assumptions.

评论

We are glad to hear that our response addresses some of the reviewer's concerns, and we greatly appreciate the reviewer for raising the score.

The new proof has been presented in full detail in the previous response, and it will be included in the next version of this work. Next, we would like to provide justification for our newly imposed assumption, which does not undermine the contributions of this work. First, it does not imply any strong curvature properties (such as strong monotonicity of the operator or strong convexity of the feasible set). Moreover, our original motivation for allowing the update to be noisy in Algorithm 2 is that it can directly be used to obtain the convergence bound of smoothed fictitious play, where the newly imposed assumption is automatically satisfied due to the linear structure of the operator F()F(\cdot). In either the completely deterministic setting (where wt=0w_t=0) or the more challenging stochastic setting (where F(xt)F(x_t) is noisy), both our original approaches work.

审稿意见
6

This paper focuses on solving the monotone MVI. To address this problem, the authors propose a new algorithm, which is a generalized variant of the Frank-Wolfe method. They also investigate the O(T½)O(T^{-½}) last-iterate convergence, which matches the best-known results in this context. Additionally, they present an O(T)O(T^{-⅙}) last-iterate convergence in the stochastic setting, which is, to their knowledge, the first last-iterate convergence result for solving constrained stochastic MVI problems.

优点

  • The paper is well-written and easy to follow.

  • The authors establish the first last-iterate convergence result for solving constrained stochastic MVI problems.

  • They demonstrate the last-iterate convergence result for the proposed generalized Frank-Wolfe method. Instead of using computer-aided proofs such as SOS and performance estimation, their analysis primarily relies on the Lyapunov argument.

缺点

See the Questions part.

问题

  • From my understanding, the key idea of the Frank-Wolfe method lies in the linear minimization oracle (LMO), which has been shown to be more efficient in computation [1]. In the proposed generalized Frank-Wolfe method, the additional regularizer disrupts the linear structure of the minimization problem. If we choose the regularizer to be a quadratic function, then line 3 of your algorithm reduces to the projection onto the feasible set X\mathcal{X}. Therefore, I would like to know if there are any advantages compared to those projection-based algorithms, such as EG/OGDA? Would it be possible to provide an explanation of it from both theoretical aspect and empirical performance?

  • In line 808 of your proof, the second equality is from the update of xt+1x^{t+1}. I think it lost the wtw^t part. I am not sure whether this ignored part will influence the convergence rate you derived.

  • Could you please elaborate more on the introduction of the momentum step for gradient estimation yty^t in Algorithm 3? Is it essential for the convergence in the stochastic version? If so, could you provide the motivation for how it guarantees convergence?

[1] Combettes, Cyrille W., and Sebastian Pokutta. "Complexity of linear minimization and projection on some sets." Operations Research Letters 49.4 (2021): 565-571.

局限性

The convergence analysis heavily depends on the regularizer to ensure the Lipschitz property of the solutions of the minimization problem (see Lemma 4.1). When τ=0\tau=0, it reduces to the FW algorithms, and its analysis is not provided. The authors also mention this in the conclusion.

评论

Comment: Could you please elaborate more on the introduction of the momentum step for gradient estimation yty_t in Algorithm 3? Is it essential for the convergence in the stochastic version? If so, could you provide the motivation for how it guarantees convergence?

Response: Algorithm 3 Line 3 is indeed a key step in the algorithm design and is essential for the convergence analysis. In the following, we first consider the case without the variance-reduced gradient estimation step, illustrate the potential issues, and then elaborate on why introducing Algorithm 3 Line 3 addresses this issue.

Suppose that we directly use the noisy estimate F(xt)+ztF(x_t)+z_t in Algorithm 3 Line 4 to compute the Frank-Wolfe direction (instead of going through Algorithm 3 Line 3 to construct the variance-reduced estimate yty_t). Note that, despite replacing the hardmin with a softmin (by introducing the regularizer), the algorithm is extremely sensitive to the noise ztz_t because the Frank-Wolfe direction computed from the accurate F(xt)F(x_t) and the noisy version F(xt)+ztF(x_t)+z_t could be completely different. As a result, due to the lack of control on the noise, there is no convergence of {xt}\{x_t\}.

Now, consider Algorithm 3 Line 3. Recall that we created a timescale separation between yty_t and xtx_t by choosing βα\beta\gg \alpha. Therefore, from the perspective of yty_t, xtx_t is close to being stationary. For the purpose of illustration and for understanding the update equation of yty_t, suppose that xtx_t is completely stationary (which corresponds to α=0\alpha=0). In this case, we will drop the subscript tt and just write xtx_t as xx. Given a sequence of noisy estimates (F(x)+zi){(F(x)+z_i)}, 0it0\leq i\leq t, of F(x)F(x), by running Algorithm 3 Line 3, yty_t is essentially a convex combination (hence a weighted average) of all the historical noisy estimates (F(x)+zi)0it( F(x)+z_i )_{0\leq i\leq t}. Therefore, yty_t is an unbiased estimate of F(x)F(x) with a much smaller variance compared with F(x)+ztF(x)+z_t. As a result, compared with directly using F(x)+ztF(x)+z_t, the Frank-Wolfe direction computed from using yty_t in Algorithm 3 Line 4 is a much more accurate estimate of the Frank-Wolfe direction computed from using the accurate F(x)F(x), which eventually leads to the convergence of Algorithm 33.

As a final remark, we assumed xtx_t being stationary from the perspective of yty_t in the above illustration, while both xtx_t and yty_t are moving simultaneously (while at different rates) in Algorithm 3. Therefore, the proof consists of a rigorous two-timescale analysis by studying both the estimation error ytF(xt)2\vert|y_t-F(x_t)\vert|^2 and the gap of the Lyapunov function V()V(\cdot) at the same time.

Comment: The convergence analysis heavily depends on the regularizer to ensure the Lipschitz property of the solutions of the minimization problem (see Lemma 4.1). When τ=0\tau=0, it reduces to the FW algorithms, and its analysis is not provided. The authors also mention this in the conclusion.

Response: The introduction of the regularizer is indeed crucial for the analysis. When τ=0\tau=0, even in the case of fictitious play for zero-sum games (which is a special case of our setup), establishing the O(1/T)O(1/\sqrt{T}) rate of convergence has been an open problem for more than 60 years. This is called the Karlin’s weak conjecture [1]. Resolving this open problem (and extending it to Frank-Wolfe for solving general MVI problems) is the ultimate goal of this line of research.

[1] Karlin, S. (2003). Mathematical methods and theory in games, programming, and economics (Vol. 2). Courier Corporation.

评论

Combining Eqs. (2) and (3) with Eq. (1), assuming without loss of generality that the regularization parameter τ\tau is chosen to be less than 11, we have

E_1=(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))+(x_t-s_t)^\top (F(x_{t+1})-F(x_t))$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\left(\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\tau \sigma_f}\right)\vert|x_{t+1}-x_t\vert|^2$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\left(\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\sigma_f}\right)\frac{1}{\tau}\vert|x_{t+1}-x_t\vert|^2$$ $$= -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+ \frac{M}{\tau}\vert|x_{t+1}-x_t\vert|^2,\qquad\qquad\qquad(4)

where we denote M=2LJDX2+L2σfM=\frac{2L_JD_{\mathcal{X}}}{2}+\frac{L^2}{\sigma_f} in Eq. (4) for simplcity of notation. To proceed, using the same line of analysis as we did from the third inequality to the last inequality in Line 808, we have

E_1 \leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{M}{\tau}\vert|x_{t+1}-x_t\vert|^2\qquad\qquad\qquad$$ $$\leq -\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{8\alpha_t^2MD_{\mathcal{X}}^2}{\tau} +\frac{2M\alpha_t^2}{\tau}\vert|w_t\vert|^2.

In view of the previous inequality, the first term on the right-hand side is mean zero due to {wt}\{w_t\} being a martingale difference sequence (cf. Assumption 4.2), and the rest of the terms are the same (up to a multiplicative constant that is independent of τ\tau) as the terms on the right-hand side of the last inequality in Line 808. Following the same analysis starting from Line 809, the overall finite-time bound (with the same rate of convergence) can be reproduced.

In summary, the convergence analysis for Algorithm 2 can be reproduced with the additional assumption that the Jacobian matrix of F()F(\cdot) is Lipschitz continuous. The analysis of Algorithm 1 for zero-sum games remains intact since the operator F()F(\cdot) in that case is linear, hence the Jacobian is a constant, which is automatically Lipschitz. The analysis for Algorithm 3 remains intact as it does not require the additional assumption.

It is an interesting future direction to investigate whether the additional Lipschitz continuity assumption on the Jacobian matrix is necessary for the convergence of Algorithm 2. Intuitively, when there is stochasticity in the algorithm, we need some form of smoothness in the update to control the stochastic error. For Algorithm 3, this is achieved through the algorithm design (by creating a timescale separation between the variance reduced stochastic estimate yty_t and the main iterate xtx_t). In Algorithm 2, since the noise is directly added to the update equation, we imposed the additional Lipschitz continuity assumption for the analysis.

评论

Thank you for the response. I appreciate the rebuttal, as it has effectively addressed some concerns, particularly regarding the gap in the proof. I will adjust the score accordingly. However, I believe more effort is needed to remove the new introduced assumption.

评论

We are glad to hear that our response addresses some of the reviewer’s concerns, especially regarding the proof gap, and we greatly appreciate the reviewer for raising the score. Next, we would like to further justify our newly imposed assumption, which does not undermine the contributions of this work. First, it does not imply any strong curvature properties (such as strong monotonicity of the operator or strong convexity of the feasible set). Moreover, our original motivation for allowing the update to be noisy in Algorithm 2 is that it can directly be used to obtain the convergence bound of smoothed fictitious play, where the newly imposed assumption is automatically satisfied due to the linear structure of the operator F()F(\cdot). In either the completely deterministic setting (where wt=0w_t=0) or the more challenging stochastic setting (where F(xt)F(x_t) is noisy), both our original approaches work.

作者回复

Comment: From my understanding, the key idea of the Frank-Wolfe method lies in the linear minimization oracle (LMO) ...

Response: We agree with the reviewer that by adding a regularizer, one can no longer use an LMO to find the Frank-Wolfe direction. However, also due to the regularizer, the problem of finding the Frank-Wolfe direction becomes a strongly convex optimization problem, which can be solved efficiently, or even admits closed-form solutions (such as in zero-sum games). Therefore, the additional computational challenge in obtaining the Frank-Wolfe direction is not an issue.

Compared with related projection gradient-based algorithms, such as EG, in the deterministic setting, we do not see any clear advantage, as they both achieve the same rate of convergence and similar empirical performance (see Appendix E Figure 1). In the stochastic setting, our method has a clear advantage since it is provably convergent while EG is shown to be unstable in general (see reference [15] in our work). In the next version of this work, we will provide a more extensive numerical study to compare the performance of generalized Frank-Wolfe to EG and OG.

That being said, the goal of this paper was not to propose an algorithm that outperforms existing benchmarks. In recent years, existing studies of monotone variational inequalities have primarily focused on gradient-based algorithms such as EG and OG. On the other hand, Frank-Wolfe is also a natural algorithm (but is, to some extent, overlooked), and has a nice connection with fictitious play in zero-sum games. We study this algorithm and show that with a simple modification (that is, adding a regularizer), the convergence rate of Frank-Wolfe matches with that of EG and OG in the deterministic setting and is provably convergent in the stochastic setting.

Comment: In line 808 of your proof, the second equality is from the update of xt+1x_{t+1}. I think it lost the wtw_t part. I am not sure whether this ignored part will influence the convergence rate you derived.

Response: We greatly appreciate the reviewer for carefully reading our work. The missing wtw_t in Line 808 was indeed a typo. Next, we present a fix to this issue. We start with the decomposition of the term E1E_1 in Line 808:

E1=(stst+1)(F(xt+1)F(xt))+(xtst)(F(xt+1)F(xt))(1)E_1=(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))+(x_t-s_t)^\top (F(x_{t+1})-F(x_t))\quad\qquad (1)

For the first term on the right-hand side of the previous inequality, we have by Assumption 4.1 and Lemma 4.1 that

(stst+1)(F(xt+1)F(xt))stst+1F(xt+1)F(xt)L2τσfxt+1xt2.(2)(s_t-s_{t+1})^\top (F(x_{t+1})-F(x_t))\leq \vert|s_t-s_{t+1}\vert|\vert|F(x_{t+1})-F(x_t)\vert| \leq \frac{L^2}{\tau \sigma_f}\vert|x_{t+1}-x_t\vert|^2.\quad (2)

This is the same as what we did in Line 808 (from the first inequality to the third inequality).

The typo was with our approach in bounding the second term on the right-hand side of Eq. (1). Next, we present a different approach to bound it, which requires the following additional but mild assumption.

Assumption: The Jacobian matrix J()J(\cdot) of the operator F()F(\cdot) is Lipschitz continuous.

Since F()F(\cdot) is Lipschitz continuous (Assumption 4.1), its Jacobian matrix is well-defined almost everywhere. Assumption imposes a mild smoothness condition on the Jacobian matrix, which is automatically satisfied in the zero-sum game setting, where the operator F()F(\cdot) is a linear operator. Note that the monotonicity of F(x)F(x) implies that J(x)+J(x)J(x)+J(x)^\top is a positively semidefinite matrix for any xXx\in\mathcal{X}.

Denote the Lipschitz constant of J()J(\cdot) as LJL_J. To proceed, in view of the second term on the right-hand side of Eq. (1), for a given vRdv\in\mathbb{R}^d, consider the function gv(x)=vF(x)g_v(x)=v^\top F(x). Since gv(x1)gv(x2)=(J(x1)J(x1))vLJvx1x2,\vert|\nabla g_v(x_1)-\nabla g_v(x_2)\vert|=\vert|(J(x_1)-J(x_1))^\top v\vert|\leq L_J\vert|v\vert|\vert|x_1-x_2\vert|, the function gv(x)g_v(x) is an LJvL_J\vert|v\vert|-smooth function with respect to xx. Therefore, using the descent lemma, we have for any x1,x2Xx_1,x_2\in\mathcal{X}: gv(x2)gv(x1)(x2x1)J(x1)v+LJv2x1x22.g_v(x_2)-g_v(x_1)\leq (x_2-x_1)^\top J(x_1)^\top v+\frac{L_J\vert|v\vert|}{2}\vert|x_1-x_2\vert|^2. Identifying v=xtstv=x_t-s_t, x1=xtx_1=x_t, and x2=xt+1x_2=x_{t+1} in the above inequality, the second term on the right-hand side of Eq. (1) can be bounded as follows: (xtst)(F(xt+1)F(xt))(x_t-s_t)^\top (F(x_{t+1})-F(x_t))\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad (xt+1xt)J(xt)(xtst)+LJxtstxt+1xt22\leq (x_{t+1}-x_t)J(x_t)^\top (x_t-s_t)+\frac{L_J\vert|x_t-s_t\vert|\vert|x_{t+1}-x_t\vert|^2}{2}\qquad\qquad\qquad\qquad\qquad\qquad\qquad =αt(xtst+wt)J(xt)(xtst)+LJxtstxt+1xt22= -\alpha_t(x_t-s_t+w_t)J(x_t)^\top (x_t-s_t)+\frac{L_J\vert|x_t-s_t\vert|\vert|x_{t+1}-x_t\vert|^2}{2}\qquad\qquad\qquad\qquad\qquad αt(xtst)[J(xt)+J(xt)2](xtst)αtwtJ(xt)(xtst)+2LJDX2xt+1xt2\leq -\alpha_t(x_t-s_t)^\top \left[\frac{J(x_t)+J(x_t)^\top}{2}\right] (x_t-s_t)-\alpha_tw_t^\top J(x_t)^\top (x_t-s_t)+\frac{2L_JD_{\mathcal{X}}}{2}\vert|x_{t+1}-x_t\vert|^2

where the first equality follows from the update equation for xtx_t, the second inequality follows from xtstxt+st2DX\vert|x_t-s_t\vert|\leq \vert|x_t\vert|+\vert|s_t\vert|\leq 2D_{\mathcal{X}} (recall that DXD_{\mathcal{X}} is the radius of the feasible set), and the last inequality follows from J(xt)+J(xt)J(x_t)+J(x_t)^\top being a positively semidefinite matrix (due to the monotonicity of F()F(\cdot)).

评论

We greatly appreciate the reviewer’s time and effort in evaluating our work. Please let us know if our response addresses the concerns raised, or if further clarification is needed.

审稿意见
9

This paper presents combines Frank-Wolfe algorithm and entropy regularization trick to propose a generalized Frank-Wolfe algorithm for solving MVI problems. The paper proves O(T1/2)O(T^{-1/2}) last-iterate convergence rate for the deterministic case. And it extends the algorithm to stochastic MVIs with O(T1/6)O(T^{-1/6}) last-iterate convergence rate.

优点

  1. I think it's a good idea to combine entropy regularization with FW algorithm.

  2. The paper is very well written and easy to follow. The intuition for the proposed algorithms is well explained. The assumptions are standard and clearly listed. The dependence of the convergence rates on everything is explicitly written out.

  3. Strong theoretical guarantees are provided in this paper. For example, it gives the first last-iterate convergence guarantee for constrained stochastic MVIs without strong curvature assumptions. It also gives the first finite-time analysis of smoothed FP.

  4. Numerical experiments are conducted to demonstrate the efficiency of the proposed algorithms.

缺点

I think there's no apparent weaknesses. I suspect the analysis for the stochastic case might not be tight, leaving room for improvement.

问题

局限性

Yes.

作者回复

Comment: I think there's no apparent weaknesses. I suspect the analysis for the stochastic case might not be tight, leaving room for improvement.

Response: We greatly appreciate the reviewer acknowledging the contributions of our work. We agree with the reviewer that the convergence rate of O(1/T1/6)O(1/T^{1/6}) in the stochastic setting is likely sub-optimal. Improving the convergence rate (by algorithm design or by analysis) is an immediate future direction of this work.

审稿意见
6

This paper proposes a Frank-Wolfe algorithms for solving monotone VI problems, for both deterministic and stochastic settings, proof that the first one matches the optimal convergence rate, and provide last-iterate convergence guarantees for the second one, which is a novelty if no curvature requirement is imposed on neither set nor operator.

优点

The analogy between classical game theory algorithm and Frank-Wolfe is simple yet creative and fruitful for providing theoretical guarantees for the former one. The operator smoothing approach in the design of the stochastic algorithm is also elegant and effective. I consider this paper worth reading thanks to these theoretical findings. Numerical experiments are also needed, but they show no significant advantage over alternative approaches, so the contribution of the paper seems only theoretical.

缺点

Despite the theoretical focus of the paper, it is recommended to make experiments more contentful by considering other problems, at least games like Burglar-Policeman matrix game, or some more advanced example of VI, maybe involving the adversarial NN. Regarding the theory, I see no significant flaws.

问题

Typos and notation related remarks:

  • page 1, line 19: \to is more convenient than \mapsto
  • page 2, sentence at line 91: "the algorithm has been shown to diverge" is unclear, you have just mentioned the algorithm you have proposed, and it does not diverge, so which one does?
  • page 3, line 106: in phrase "in either the uncontrained regime [13 ] or under", "in" should be after "either", and "uncontrained" has typo
  • page 4, line 146: subscript and superscript for A are confused
  • Condition 4.1 allows one to remove limit s \in X of max in line 3 of Algorithm 2, at least in accordance with line 217
  • page 6, line 228: "make ... assumptions" or "impose ... requirements"

Besides, the "shadow" indicating the std of function's values from multiple random runs should be added to the plots with stochastic Frank-Wolfe method.

局限性

The limitations are clear from reading.

作者回复

Comment: Despite the theoretical focus of the paper, it is recommended to make experiments more contentful by considering other problems ...

Response: We thank the reviewer for the suggestion. In the next version, we will include an additional numerical example to demonstrate the performance of the generalized Frank-Wolfe algorithm relative to other algorithms.

Comment: Typos and notation related remarks.

Response: We thank the reviewer for carefully reading our work. We will fix the typos and notation-related problems on pages 1, 3, 4, Condition 4.1, and page 6 as pointed out by the reviewer. Regarding the sentence ‘the algorithm has been shown to diverge’ in line 91 of page 2, it refers to the extragradient algorithm. We will clarify this in the next version of our work.

Comment: Besides, the "shadow" indicating the std of function's values from multiple random runs should be added to the plots with stochastic Frank-Wolfe method.

Response: We will plot the standard deviation for the stochastic Frank-Wolfe algorithm in our numerical simulations.

评论

We greatly appreciate the reviewer’s time and effort in evaluating our work. Please let us know if our response addresses the concerns raised, or if further clarification is needed.

评论

We thank all reviewers for carefully reading our work. In the following, we provide a point-by-point response to all reviewers' comments.

最终决定

The reviewers are unanimous in favor of this submission.