7.0

/10

Poster4 位审稿人

最低6最高8标准差0.7

3.0

置信度

正确性3.5

贡献度3.5

表达3.0

NeurIPS 2024

Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms

Yang Cai,Gabriele Farina,Julien Grand-Clément,Christian Kroer,Chung-Wei Lee,Haipeng Luo,Weiqiang Zheng

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

We prove that a broad class of algorithms that do not forget the past quickly, including OMWU and OFTRL, all suffer arbitrarily slow last-iterate convergence even in simple zero-sum games.

摘要

Self play via online learning is one of the premier ways to solve large-scale zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages, including logarithmic dependence on the size of the payoff matrix and $\tilde{O}(1/T)$ convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of $(1/\sqrt{T})$, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small $\delta>0$, there exists a $2\times 2$ matrix game such that the algorithm admits a constant duality gap even after $1/\delta$ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.

关键词

Last-Iterate ConvergenceZero-Sum GamesOptimistic Multicaptive Weights UpdateOptimistic Gradient

评审与讨论

审稿意见

评分: 8置信度: 32024-07-07

The present work studies the last iterate convergence of learning algorithms in zero sum games. Existing results have shown that although OMWU and OGDA both enjoy the $O(1/T)$ ergodic convergence rate, their best existing last iterate convergence rates exhibit a nontrivial gap. This paper constructs a particular hard example showing that such a gap is due to a fundamental limitation of the OMWU algorithm, rather than the sub-optimality of existing analysis. From this hard example, the key insight is that fast last iterate convergence requires learning algorithms to forget the past.

优点

This paper is very cool. Last iterate convergence is an important problem for learning in games, and this paper naturally motivates it by the existing theoretical and empirical gap between OMWU and ODGA. To tackle this problem, the paper then presents a particularly nontrivial analysis on the learning dynamics of Optimistic FTRL in games, despite the appeared simplicity of the constructed example. The takeaway message on the importance of forgetting is elegant, which might inspire future works on algorithm design. I also appreciate the graphical illustration on the learning dynamics, which makes the very technical proof intuitively easier to understand. Related works seem to be thoroughly discussed.

Frankly speaking, the analysis in this paper is quite involved, so I only have a high level understanding of the proof while struggling on the details. However, as far as I can see, the high level intuition makes sense, and the numerical experiments are quite helpful.

缺点

This paper is already strong in my opinion, so I don't have any major complaint. One possible aspect to improve is the presentation. There are certain parts in the analysis that are particularly involved, so it might be better to write less quantitatively in the main paper, and leave those parts to the appendix. For example, Assumption 2 and the discussion right after it. I would also appreciate a further simplified proof sketch, specifically, more words and less math.

问题

Some minor comments:

In line 51-51 it is quoted from (van Erven, 2021) that FTRL is conventionally believed to be better than OMD. From the perspective of online convex optimization, I feel this is actually a bit more nuanced, as the examples I'm aware of (where FTRL beats OMD) concern time varying learning rates, which is different from the constant learning rates considered in this paper.
The wording of Theorem 2 could be improved.

局限性

Limitations are adequately addressed.

作者回复

2024-08-07

Thank you for the positive review and the constructive comments on the presentation. We will incorporate your suggestions in the revised version of the paper. Specifically, we will update the flow in section 3 and make Assumption 2, the discussion, and the proof sketch more reader-friendly. Below, we answer your questions.

Q1: In line 51-51 it is quoted from (van Erven, 2021) that FTRL is conventionally believed to be better than OMD. From the perspective of online convex optimization, I feel this is actually a bit more nuanced, as the examples I'm aware of (where FTRL beats OMD) concern time varying learning rates, which is different from the constant learning rates considered in this paper.

A: Thank you for raising this point! We will add a remark for this claim in the revised paper. We also conjecture that slow last-iterate convergence rates for OFTRL algorithms persist even for time-varying learning rates. In particular, we conjecture that our hard example works for decreasing step sizes. In the rebuttal PDF, we provide numerical experiment results for AadGrad stepsize, an adaptively decreasing step size schedule. The results show that the last iterate of the decreasing step-size version exhibits qualitatively similar behavior to its constant step-size counterparts. We conjecture that our results could be extended to the case of decreasing step sizes. However, a time-dependent learning rate that occasionally increases the step size or other periodic learning rate schedules might be able to address the issue of non-forgetfulness. We will discuss time-varying step sizes in the next version of the paper. We leave the question of how time-varying step sizes affect the last-iterate convergence rates of OFTRL algorithms as an interesting and important open question.

Q2: The wording of Theorem 2 could be improved.

A: Thank you for the comment! We actually accidentally omitted the term “there is no function $f$ such that” before “... the corresponding learning dynamics” in the statement. We will improve the wording in the revised version.

2024-08-13

Thank you for answering my questions.

审稿意见

评分: 6置信度: 22024-07-13

The paper investigates the last-iterate convergence properties of several classes of algorithms employed in zero-sum matrix games. As a core contribution, it is shown that a large class of iteration rules can not exhibit instance-independent convergence in zero-sum games of dimension 2. The result is further generalized to larger dimensions.

优点

The paper tackles an important problem and makes a potentially significant contribution to our understanding of learning dynamics in zero-sum matrix games.

I think the results are rigorously developed, and while I did not verify the details of all the proofs, they seem to be correct.

缺点

The concept of a "forgetful algorithm" could be made more explicit, perhaps as a definition. It is not clear to me what property makes an algorithm forgetful by construction. An explicit definition (if it is possible at all) could greatly clarify the work.

As a possible weakness, the stated negative results apply to a particular class of algorithms (OFTRL) and potentially might be sidestepped.

The formulation of the main theorems could be clarified or perhaps simplified. For instance, theorem 2 has a rather complicated statement, which to my understanding just means either the duality gap can not go to zero (in which case it can be stated as such directly), or the duality gap can not go to zero uniformly (in which case item 1 should have a statement "over all possible loss matrices").

I think one weakness of the paper is the presentation/organization. I found it difficult to follow the main ideas and the main contribution of the paper. For instance, the generalization to higher dimensions, while interesting, seems to be a minor contribution from a theoretical perspective, while significant space is dedicated to its proof ideas.

问题

Is it possible to extract a "forgetfulness" property for an arbitrary learning algorithm that must imply the algorithm has slow last iterate convergence?

Are there significant differences between Theorems 1 and 2 to warrant stating them separately? Upon first read, it was difficult to understand what the main difference was, and Theorem 2 seems to be a corollary of Theorem 1 looking at the proof in the appendix.

局限性

See weaknesses.

作者回复

2024-08-07

Thank you for acknowledging that we made a significant contribution to an important question and for the constructive comments on the presentation. We will incorporate your suggestions and improve the presentation in the next version of the paper. Below, we address your questions.

Q1: Is it possible to extract a "forgetfulness" property for an arbitrary learning algorithm that must imply the algorithm has slow last iterate convergence?

A: Our paper aims to understand the slow last-iterate convergence of OMWU and, more generally, OFTRL-type and certain Optimistic OMD-type algorithms. Our proof and hard game instances build on the intuition that these algorithms lack forgetfulness: they do not forget the past quickly and play actions that performed well in the past, even if their recent performance is bad (see also Lines 70-99 in the Introduction). We are the first to connect the intuitive idea of forgetfulness to last-iterate convergence. Even without a formal definition of “forgetfulness”, we believe that this intuitive connection with last-iterate convergence is a novel contribution.

Putting the finger exactly on a formal, precise definition of forgetfulness appears to be challenging. We did try, but found it rather difficult. For example, our first attempt was to define it as any algorithm that is "mean-based", that is, the decision at each time depends on the past gradients solely through their mean. Such algorithms give equal weights to all past observations and are thus intuitively not forgetful. However, we quickly realized that while this covers all FTRL algorithms, it does not even cover their optimistic version (OFTRL, which is our main focus). We hope this convinces the reviewer that finding a formal and general definition for the lack of forgetfulness could be non-trivial. At any rate, the “lack of forgetfulness” is just an intuitive explanation for the phenomenon. We view our most important (and surprising) result to be the fact that all known OFTRL variants have these convergence issues. It would be interesting to come up with a formal definition that captures a broader class of algorithms, though that seems difficult and we leave it as future work.

Q2: Are there significant differences between Theorems 1 and 2 to warrant stating them separately? Upon first read, it was difficult to understand what the main difference was, and Theorem 2 seems to be a corollary of Theorem 1 looking at the proof in the appendix.

A: We clarify that Theorem 2 is indeed a corollary of Theorem 1. Thank you for pointing it out; we will make it clear in the next version of the paper. Theorem 1 is a more technical theorem that states the OFTRL learning dynamics have a constant duality gap even after $1/\delta$ steps. Theorem 2 is a more high-level theorem that states OFTRL learning dynamics cannot have a last-iterate convergence rate that solely depends on the dimension and the time. Thus, game-parameter dependence is unavoidable for OFTRL algorithms, which provides a clearer comparison between OFTRL and OGDA since the latter has an $O(\frac{\textnormal{poly}(d_1, d_2)}{\sqrt{T}})$ last-iterate convergence rate. In summary, while Theorem 2 is a direct corollary of Theorem 1, we state Theorem 2 separately to highlight our main contribution and the difference between OFTRL and OGDA.

C1: As a possible weakness, the stated negative results apply to a particular class of algorithms (OFTRL) and potentially might be sidestepped.

A: Although our results only hold for OFTRL, we believe they are still significant given that OFTRL is arguably one of the most studied classes of online learning algorithms and covers many standard algorithms, including the classical OMWU. We also want to emphasize that the existing positive results are even more restrictive: only OGDA and OMWU have proven concrete last-iterate convergence rates. Inspired by OMWU’s weaker convergence guarantees compared to OGDA, we demonstrate that this disadvantage is unavoidable and, in fact, applies to a broader class of OFTRL algorithms.

2024-08-13

I thank the authors for the detailed responses, reading the other reviews and the rebuttal I understand that the main point of the work is analyzing the "OFTRL" class of algorithms. I agree with the other reviewers that the work merits a better score and have raised mine.

审稿意见

评分: 7置信度: 32024-07-13

This paper shows a limitation of optimistic multiplicative weights (OMWU) with a fixed learning rate: the last-iterate convergence rate can be arbitrarily large, depending on a game-dependent constant in normal form games. This demonstrates that the current upper bounds on convergence are not loose and that there is a real barrier to this dynamic.

Broader Context: Optimistic online learning algorithms play an important role in solving games. The average iterates of OMWU have been shown to converge in zero-sum games at a rate of 1/T, which is better than the non-optimistic counterpart. Another important algorithm in this context is the online gradient descent ascent (OGDA), which has been shown to have last-iterate convergence of O(1/sqrt{T}), where the constant hides polynomial dependence on the problem parameters. Last-iterate convergence is a desirable property in the context of games. Since OMWU is a central tool for solving games, understanding its limitations is a fundamental question.

This paper reveals an inherent limitation of OMWU (and more generally, for some algorithms within the optimistic FTRL framework), showing that OGDA might sometimes enjoy better last-iterate convergence. The main property responsible for this limitation is the "lack of forgetfulness."

优点

The paper answers a fundamental question in the "learning in games" community (solving games using online learning dynamics). I like also the presentation of the paper. For example, the authors made an effort to explain the intuition behind the construction of the hard game.

缺点

My only criticism is the following: It might be the case that OMWU can have good last-iterate convergence when using changing learning rates. These are sometimes used to avoid such an issue of forgetfulness.

(But I believe that analyzing fixed learning rate is a significant contribution on its own)

问题

Do you think using time-dependent learning rates might improve the last-iterate convergence of OMWU? If so, maybe it's worth discussing it or writing it as a limitation.

局限性

See the above.

作者回复

2024-08-07

Thank you for your encouraging comments!

Q: Do you think using time-dependent learning rates might improve the last-iterate convergence of OMWU? If so, maybe it's worth discussing it or writing it as a limitation.

A: We conjecture that slow last-iterate convergence of OMWU persists for time-dependent step sizes (learning rates). In particular, we conjecture that our hard example works for decreasing step sizes. In the rebuttal PDF, we provide numerical experiment results for AadGrad stepsize, an adaptively decreasing step size schedule. The results show that the last iterate of the decreasing step-size version exhibits qualitatively similar behavior to its constant step-size counterparts. We conjecture that our results could be extended to the case of decreasing step sizes. However, a time-dependent learning rate that occasionally increases the step size or other periodic learning rate schedules might be able to address the issue of non-forgetfulness. We will discuss time-varying step sizes in the revised version of the paper. We leave the question of how time-varying step sizes affect the last-iterate convergence rates of OFTRL algorithms as an interesting and important open question.

2024-08-12

Thanks for your great response. I will keep my positive score.

审稿意见

评分: 7置信度: 42024-07-15

The authors, through some hard instances of games study the fundamental differences in convergence of broadly two classes of algorithms which are OOMD and OFTRL. Particular algorithms of interest include the OGDA and the OMWU algorithm respectively. They show that OFTRL (in particular OMWU) necessarily exhibits slow last-iterate convergence and this leads to it having an undesirable dependence on the underlying game.

优点

This paper seems to capture an important property that differentiates the likes of algorithms such as OGDA and OMWU, two extremely popular optimistic algorithms known to achieve last-iterate convergence in zero-sum games. Now this paper mounts further evidence of ``slowness" of OWMU and that it is unavoidable to have game dependent constants appearing in the guarantees! I think this is an important finding for this area.

缺点

This is a focused study on showing certain properties of a class of algorithms such as OFTRL and OOMD. In general it is not clear if some of the results apply to every regularizer satisfying the assumptions or to the particular ones of interest such as Euclidean, Entropic etc. Beyond a few clarifications which I have asked in the next section, I do not think there are major weaknesses.

问题

Please clarify whether Assumptions 1 and 2 are satisfied by the specific regularizers such as entropic, euclidean and log or general regularizers that are 1-strongly convex wrt to $l_2$ norm.
Does the hard example still hold with vanishing step sizes?
Is the high-dimensional example applicable only for the specific regularizers mentioned.

局限性

yes.

作者回复

2024-08-07

Thank you for the positive review! Below, we address your questions.

Q1: Please clarify whether Assumptions 1 and 2 are satisfied by the specific regularizers such as entropic, euclidean and log or general regularizers that are 1-strongly convex wrt to $\ell_2$ norm.

A: We clarify that we only verify Assumptions 1 and 2 for specific regularizers, but not all 1-strongly convex regularizers wrt to $\ell_2$ norm. In this current manuscript, we verify the two assumptions for the negative entropy, Euclidean and log regularizer. We can also show that the two assumptions hold for the family of (negative) Tsallis entropy regularizers: parameterized by $0 < \beta < 1$ , $R(x) = \frac{1 - \sum x[i]^{\beta}}{1-\beta}$ . Together with entropy, Euclidean, and the log regularizer, we believe our results cover every commonly-used regularizer in the online learning literature. We will add the proof for Tsallis entropy regularizers in the revised version of the paper.

Q2: Does the hard example still hold with vanishing step sizes?

A: This is a great question! Our current proof only works for fixed step sizes. In the rebuttal PDF, we provide numerical results for AdaGrad stepsize, an adaptively decreasing step size schedule. The results show that the last iterate of the decreasing step-size version exhibits qualitatively similar behavior to its constant step-size counterparts. We conjecture that our results could be extended to the case of decreasing step sizes. We will add a discussion and leave whether OFTRL with vanishing/variable step size could have a fast last-iterate convergence rate as an important future direction.

Q3: Is the high-dimensional example applicable only for the specific regularizers mentioned?

A: The reduction in the high-dimensional setting requires structures of the regularizers and might not hold for all 1-strongly convex regularizers. As shown in the current paper, the reduction works for negative entropy, Euclidean, and log regularizers. We can also prove that the reduction works for the family of Tsallis entropy regularizers (see Answer to point 1 above). We will add this result in the revised version of the paper.

评论- Response to the Rebuttal

2024-08-10

I thank the authors for the clarifications and the experiments with vanishing step sizes. I will maintain my already positive score.

作者回复

2024-08-07

Discussion on Dynamic Step Sizes: Our negative results hold for OFTRL with fixed step sizes. We conjecture that the slow last-iterate convergence of OFTRL persists even with dynamic (time-varying) step sizes. In particular, we believe our counterexamples still work for OFTRL with decreasing step sizes. This is because decreasing the step size makes the players move even slower, and they may be trapped in the wrong direction for a longer time due to the lack of forgetfulness. In the PDF, we include numerical results for OFTRL with adaptive stepsize akin to Adagrad [1], which supports our intuition. We observe that the duality gap of the last iterate exhibits qualitatively similar behavior to its fixed step size counterparts. Investigating the effect of dynamic step sizes on last-iterate convergence rates is an interesting future direction.

[1] Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of machine learning research. 2011

最终决定Accept (poster)

2024-09-25

A clear accept paper. Identifying OMWU's inherent weakness is an interesting finding, and identifying the root cause -- don't forget quickly enough-- is a plus on top.