7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性2.8

质量3.3

清晰度3.5

重要性3.0

NeurIPS 2025

From Average-Iterate to Last-Iterate Convergence in Games: A Reduction and Its Applications

Yang Cai,Haipeng Luo,Chen-Yu Wei,Weiqiang Zheng

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

The convergence of online learning algorithms in games under self-play is a fundamental question in game theory and machine learning. Among various notions of convergence, last-iterate convergence is particularly desirable, as it reflects the actual decisions made by the learners and captures the day-to-day behavior of the learning dynamics. While many algorithms are known to converge in the average-iterate, achieving last-iterate convergence typically requires considerably more effort in both the design and the analysis of the algorithm. Somewhat surprisingly, we show in this paper that for a large family of games, there exists a simple black-box reduction that transforms the average iterates of an uncoupled learning dynamics into the last iterates of a new uncoupled learning dynamics, thus also providing a reduction from last-iterate convergence to average-iterate convergence. Our reduction applies to games where each player’s utility is linear in both their own strategy and the joint strategy of all opponents. This family includes two-player bimatrix games and generalizations such as multi-player polymatrix games. By applying our reduction to the Optimistic Multiplicative Weights Update algorithm, we obtain new state-of-the-art last-iterate convergence rates for uncoupled learning dynamics in multi-player zero-sum polymatrix games: (1) an $O(\frac{\log d}{T})$ last-iterate convergence rate under gradient feedback, representing an exponential improvement in the dependence on the dimension $d$ (i.e., the maximum number of actions available to either player); and (2) an $\tilde{O}(d^{\frac{1}{5}}T^{-\frac{1}{5}})$ last-iterate convergence rate under bandit feedback, improving upon the previous best rates of $\tilde{O}(\sqrt{d}T^{-\frac{1}{8}})$ and $\tilde{O}(\sqrt{d}T^{-\frac{1}{6}})$.

关键词

Last-Iterate ConvergenceLearning in GamesZero-Sum GamesBandit FeedbackOnline Learning

评审与讨论

审稿意见

评分: 5置信度: 32025-07-02

This paper studies the convergence of online learning algorithms in games, focusing on the theoretical and practically importance setting, the last-iterate convergence property of algorithms used in a large class of games. The technique of reduction from last iterates to average-iterate seems novel and significant. The results and techniques also implies generalizations in multi-player setting.

优缺点分析

Strength

The paper is well written and easy to follow. I think the authors have done a great effort trying to make the paper accessable to readers who are not experts in the narrow area.

The contributions of this paper on the improvement of convergence rates are clear and comparison to related results are presented.

The importance of reduction is the improvement for uncoupled learning in multi-player zero-sum polymatrix games. This reduction technique enables authors to obtain a last-iterate convergence rate for OMWU that matches its average-iterate rate under the same setting, say the gradient feedback. The bandit feedback is analyzed based on the same reduction technique, and this paper propose a OMWU-bandit algorithm that achieves a SOTA convergence rate compared to existing results.

Weakness

I think a general reader might be interested in the reduction technique used in this paper. If the authors could have provided a pointer to literature or open a section to discuss the background, the current layout of Section 3 mainly focuses on the technical side of this paper, and the extensive study mentioned section 4 refers to Opmtimistic learning algorithms. Since the results of this paper already say for the power of this key technique, a detailed introduction to its idea based on related works will be more helpful to curious readers.

问题

I am not totally sure whether the reduction technique is an innovation compared to the literature. What are other related papers that might be considered inspiring to this approach?

局限性

Yes, the limitation of the results and method is clearly stated in the paper.

最终评判理由

I have no futher concerns after rebuttal.

格式问题

No formatting issues.

作者回复

2025-07-31

We appreciate that the reviewer finds our paper well-written and easy to follow, and our theoretical contribution clear and important. We address the minor comment below.

I think a general reader might be interested in the reduction technique used in this paper. If the authors could have provided a pointer to literature or open a section to discuss the background, the current layout of Section 3 mainly focuses on the technical side of this paper, and the extensive study mentioned section 4 refers to Opmtimistic learning algorithms. Since the results of this paper already say for the power of this key technique, a detailed introduction to its idea based on related works will be more helpful to curious readers.

I am not totally sure whether the reduction technique is an innovation compared to the literature. What are other related papers that might be considered inspiring to this approach?

A: We thank the reviewer for the comment. While we were not aware of prior works that studied similar reductions when writing the paper, Reviewer Qfxk pointed out that Cutkosky (2019) is quite related to our work. The problem Cutkosky (2019) studied is the online-to-batch conversion, which solves an offline optimization problem using an online learning algorithm. While our A2L reduction and the anytime online-to-batch (AO2B) reduction share the idea of playing the average of previous iterates, they differ significantly in several key aspects:

Problem setting: AO2B is designed for static optimization, where the loss function remains fixed across rounds. A2L, on the other hand, applies to a dynamic multi-agent game setting, where each player’s loss function can change over time due to the actions of others.
Objective: AO2B aims to solve a single offline optimization problem with improved convergence rates. AO2B does not guarantee equivalence between the last iterate of the new learning algorithm and the average iterate of the original algorithm (since the gradient feedbacks are different). In contrast, A2L aims to show that the last iterate of the new dynamics corresponds to the averaged iterate of the original dynamics.
Use of gradient: In AO2B, the learner updates directly using the gradient at the average iterate $\bar{x}^t = \frac{1}{t} \sum_{k=1}^t x^k$ . In contrast, A2L requires post-processing the gradient feedback to recover the true gradient at the original iterate $x^t$ , as shown in Line 4 of our algorithm.

We will include this comparison with Cutkosky (2019) in the revised version of the paper and elaborate more on the A2L reduction to make it more accessible to general readers. Thanks again for the suggestion!

Cutkosky, Ashok. "Anytime online-to-batch, optimism and acceleration." International conference on machine learning. PMLR, 2019.

2025-08-09

Thanks for your reply. I have no other concerns and will keep my positive evaluation for now.

审稿意见

评分: 5置信度: 32025-07-02

The paper studies last-iterate-convergence methods for solving normal-form games. Instead of proposing a new method from scratch, it introduces a black-box reduction (A2L) that converts any online learning algorithm with average-iterate convergence into one that achieves last-iterate convergence, under the assumption of linear utilities. The authors apply the reduction to OMWU, and derive state-of-the-art convergence guarantees under gradient feedback and bandit feedback.

优缺点分析

Strengths

The paper is in general clear and well organized. Its significance and novelty over prior work are effectively communicated, with both the high-level intuition and technical contributions presented at appropriate points throughout the paper.

The reduction method is simple and intuitive. Though highly dependent on the linear utility assumption, the method still covers a number of important game classes. Clear theoretical benefits -- faster convergence rates, uncoupled learning dynamics, last-iterate convergence -- are shown in solving zero-sum games under gradient and bandit feedback, without requiring major algorithmic changes .

Weakness

The authors explained the merit of last-iterate convergence a few times in the paper, but I am not fully convinced. From the perspectives of practical implementation, tracking the average iterates is simple to carry out and does not introduce any additional time or memory complexity. I wonder if there are more concrete practical benefits from last-iterate-convergent algorithms that can help provide a better motivation.

Assumption 1 is another limitation, and I cannot see how any of the arguments in the paper can be applied/extended when the utility is nonlinear. However, this does not mean the work is unimportant. The normal-form game framework with linear utilities captures several fundamental and widely studied game classes, such as two-player bimatrix games and multi-player polymatrix games, including their zero-sum variants. Just being able to solve these games alone is a contribution.

问题

I would be interested to see how A2L-OMWU compares with other EG/OG algorithms known to have last-iterate convergence that the authors referenced in Related Works (line 102 and below). The purpose is to see whether A2L-OMWU truly converges faster both in theory and in practice, or whether the other algorithms also perform well in practice, with their current inferior convergence rates being an artifact of analysis.
Can any of the argument be generalized to Markov games (with linear reward function)?
A few highly relevant references should be included and discussed that design last-iterate-convergent algorithms for zero-sum games based on perturbation/regularization, including Abe et al., 2022, Abe et al., 2023, Wu et al., 2023, Zeng et al., 2023.

References

Abe, K., Sakamoto, M., & Iwasaki, A. (2022, August). Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games. In Uncertainty in Artificial Intelligence (pp. 1-10). PMLR.

Abe, K., Ariu, K., Sakamoto, M., Toyoshima, K., & Iwasaki, A. (2023). Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games. In AISTATS 2023, International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain (Vol. 206, pp. 7999-8028). MLResearchPress.

Wu, J., Barakat, A., Fatkhullin, I., & He, N. (2023). Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity and Last-Iterate Convergence. arXiv preprint arXiv:2309.04272.

Zeng, S., Doan, T., & Romberg, J. (2022). Regularized gradient descent ascent for two-player zero-sum Markov games. Advances in Neural Information Processing Systems, 35, 34546-34558.

局限性

N/A

最终评判理由

I do not think I missed any important contributions or weaknesses in my original evaluation and therefore would like to keep my rating unchanged.

格式问题

No concerns

作者回复

2025-07-31

We appreciate that reviewer h8Tb find our paper clear and well organized, our results significant and novel over prior work, and our reduction simple, intuitive, and has clear theoretical benefits. We address below the questions and comments.

I would be interested to see how A2L-OMWU compares with other EG/OG algorithms known to have last-iterate convergence that the authors referenced in Related Works (line 102 and below). The purpose is to see whether A2L-OMWU truly converges faster both in theory and in practice, or whether the other algorithms also perform well in practice, with their current inferior convergence rates being an artifact of analysis.

Thank you for the suggestion! We will include numerical experiments comparing A2L-OMWU with EG and OG in the revised version. We would like to highlight that A2L-OMWU enjoys logarithmic dependence on the dimension (i.e., the number of actions), whereas EG and OG suffer polynomial dependence. This polynomial dependence is likely unavoidable (that is, not an artifact of the analysis) for EG/OG due to their reliance on Euclidean geometry, as discussed in [1, 2]. Consequently, A2L-OMWU may be particularly preferred in high-dimensional settings.

[1] Levy, Daniel, and John C. Duchi. "Necessary and sufficient geometries for gradient methods." Advances in Neural Information Processing Systems 32 (2019).

[2] Cheng, Chen, Daniel Levy, and John C. Duchi. "Geometry, Computation, and Optimality in Stochastic Optimization." arXiv 2025

Can any of the argument be generalized to Markov games (with linear reward function)?

We believe the core ideas behind our A2L reduction can be extended to Markov games, where normal-form games serve as a stateless special case. The main technical challenge lies in the state transitions: unlike in normal-form games, the utility of an action in a Markov game is governed by the Q-function, which incorporates both immediate and discounted future rewards and depends on evolving policies. This dynamic structure introduces significant complexity and requires a non-trivial extension of our current reduction. We view the generalization of A2L to Markov games as an important direction for future work.

A few highly relevant references should be included and discussed that design last-iterate-convergent algorithms for zero-sum games based on perturbation/regularization, including Abe et al., 2022, Abe et al., 2023, Wu et al., 2023, Zeng et al., 2023.

Thank you for pointing out these related works! We will include and discuss them in the revised version of the paper.

In particular, [Abe et al., 2022, 2023] and [Wu et al., 2023] design algorithms based on perturbation and show last-iterate convergence to a stationary point near a Nash equilibrium in both gradient and noisy gradient feedback. [Abe et al., 2023] also show asymptotic convergence to an exact Nash equilibrium by adptively adjusting the perturbation. [Zeng et al., 2022] studies two-player zero-sum Markov games using the entropy-regularization approach and proposes a gradient-descent-ascent algorithm with a $O(T\^{-1/3})$ last-iterate convergence. Their result requires an additional assumption (Assumption 2 in their paper) on the regularized Nash equilibrium that each entry (the probability of playing an action) has a uniform lower bound independent of the regularization strength.

Abe, K., Sakamoto, M., & Iwasaki, A. (2022, August). Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games. In Uncertainty in Artificial Intelligence (pp. 1-10). PMLR.
Abe, K., Ariu, K., Sakamoto, M., Toyoshima, K., & Iwasaki, A. (2023). Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games. In AISTATS 2023, International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain (Vol. 206, pp. 7999-8028). MLResearchPress.
Wu, J., Barakat, A., Fatkhullin, I., & He, N. (2023). Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity and Last-Iterate Convergence. arXiv preprint arXiv:2309.04272.
Zeng, S., Doan, T., & Romberg, J. (2022). Regularized gradient descent ascent for two-player zero-sum Markov games. Advances in Neural Information Processing Systems, 35, 34546-34558.

The authors explained the merit of last-iterate convergence a few times in the paper, but I am not fully convinced. From the perspectives of practical implementation, tracking the average iterates is simple to carry out and does not introduce any additional time or memory complexity. I wonder if there are more concrete practical benefits from last-iterate-convergent algorithms that can help provide a better motivation.

Response: Thanks for the comment. We would like to highlight that learning dynamics with last-iterate convergence offer important advantages over divergent dynamics that only guarantee convergence in the average. First, last-iterate convergence ensures that each player achieves stronger sublinear dynamic regret bounds [See e.g., [Cai-Zheng’23 Theorem 3] and our Corollary 1], whereas divergent dynamics typically only guarantee sublinear external regret and can incur linear dynamic regret. Another advantage of last-iterate convergence is that it characterizes the day-to-day behavior of the learning dynamics and enables robust predictions on players’ behavior, which is not the case for divergent dynamics.

Cai, Yang, and Weiqiang Zheng. "Doubly optimal no-regret learning in monotone games." International Conference on Machine Learning. PMLR, 2023.

Assumption 1 is another limitation, and I cannot see how any of the arguments in the paper can be applied/extended when the utility is nonlinear. However, this does not mean the work is unimportant. The normal-form game framework with linear utilities captures several fundamental and widely studied game classes, such as two-player bimatrix games and multi-player polymatrix games, including their zero-sum variants. Just being able to solve these games alone is a contribution.

Response: We thank the encouraging comment on our contribution. We recently found that the A2L reduction can apply to other structured settings with possibly nonlinear utilities. We will first recall the core ideas of A2L reduction and then give one example of learning dynamics for market equilibrium.

The core idea of the A2L reduction lies in the “extraction” step: when players play the average of past iterates and observe corresponding feedback, they recover the feedback they would have received had they all played the original iterate. In our current framework, this is enabled by linearity, but similar structure-based recoverability can exist beyond linear settings.

One example is the proportional response dynamics (PRD) for market equilibrium in Fisher markets [Cheung et al., 2025] (where utilities can be nonlinear). In a fisher market, there are $n$ agents and $m$ goods. Each agent $i$ has a budget $e\_i$ and each good $j$ has a unit supply. The proportional response dynamic works as follows:

Budget Spending: each player $i$ decides the allocation of budget $\{b^t\_{ij} \}$ where $b^t_{ij}$ is his spending on good $j$ . It must hold that $\sum_j b^t_{ij} = e_i$ ;
Goods allocation: the amount of good $j$ allocated to agent $i$ is proportional to their spending as $x^t_{ij} = b^t_{ij} / \sum_{i} b^t_{i, j}$ . Moreover, the price of each good $j$ is defined as the sum of agents’ spending $p^t_j:= \sum_{i} b^t_{ij}$ .
Update: Based on the allocation $x^t_{ij}$ , each buyer then updates their spending of budget $\{ b^{t+1}_{ij} \}$ in the next round using certain rules.

The original PRD (for GS utility) has a $O(1/T)$ convergence rate only in terms of the average price, but not the last iterate price. However, by applying A2L reduction— let each agent spend their average budget allocation $\bar{b}^t_i = 1/t \sum_k b^k_i$ and observe the induced allocation $\bar{x}^t$ ---we can recover the original iterate's allocation due to the proportional structure. Thus, we get a modified PRD with $O(1/T)$ last-iterate convergence in price. Here, the key observation is that the feedback in PRD is the allocation $x$ , which has a nice structure despite each player having nonlinear utilities.

This example illustrates that our reduction can extend to other settings where feedback remains recoverable from averaged plays, even without linearity on the utility. We will include this example in the revised version. Identifying structures that enable A2L reduction in more general settings remains an interesting future direction.

Cheung, Yun Kuen, Richard Cole, and Yixin Tao. "Proportional Response Dynamics in Gross Substitutes Markets." Proceedings of the 26th ACM Conference on Economics and Computation. 2025.

2025-08-01

I thank the authors for the detailed response, especially the discussion on how A2L reduction may be applied to nonlinear utilities. I do not think I missed any important contributions or weaknesses in my original evaluation and therefore would like to keep my rating unchanged.

审稿意见

评分: 5置信度: 32025-07-03

average iterate convergence rates. There has been increasing interest in deriving last-iterate convergence guarantees for equilibrium learning algorithms. Such guarantees are more practically useful, but also harder to obtain.

The main contribution of the paper is an algorithm that takes an algorithm with an average-iterate convergence guarantee, and turns it into an algorithm with a last iterate convergence guarantee. The construction assumes that the utility function is linear in their own strategy. They consider both settings with gradient feedback and bandit feedback.

The authors argue that in some cases, this leads to state of the art results. By applying the approach to the OMWA algorithm for zero-sum polymatrix games, this leads to state of the art last-iterate convergence guarantees.

优缺点分析

This is an interesting result worthy of publishing. In particular, it may provide an easier route to last-iterate guarantees. It is an open question how practically useful the approach it.

The paper is well presented, and I only have some minor suggested improvements.

The proofs for the gradient-based setting are relatively short and straightforward, while the bandit based settings are more involved. I have read the proofs and only have a few remarks.

问题

Minor remarks

Second inequality after 585 (proof of Theorem 3): Comparing with Lemma 2, there seems to be a (n-1) factor missing. Do you need to assume smaller eta?

Line 598: Please give some more explanation for the expression for delta^t.

It would have been nice to see the theoretical results complemented with numerical experiments, e.g., comparing OMWU with A2L-OMWU on some interesting games.

局限性

Yes

最终评判理由

I am satisfied with the authors' rebuttal to our reviews. This is a strong paper that deserves to be accepted.

格式问题

None

作者回复

2025-07-31

We appreciate that reviewer yhjx finds our results interesting and well presented. We address the minor remarks below.

Second inequality after 585 (proof of Theorem 3): Comparing with Lemma 2, there seems to be a (n-1) factor missing. Do you need to assume smaller eta?

A: Thank you very much for pointing out this typo. The correct version of the inequalities after 585 is

\\sum_{i=1}^n reg_i (T) \le \frac{\sum_{i=1}^n \log d_i}{\eta} + \sum_{i=1}^n \\sum_{t=1}^T ( \eta ||u^t\_i - u^{t-1}\_i||^2_{\infty} - \frac{1}{4\eta} ||x^t\_i -x^{t-1}\_i||_1^2 ) \\le \frac{\sum\_{i=1}^n \log d\_i}{\eta},

where the final inequality follows from Lemma 2 and the condition $\eta \le \frac{1}{2(n-1)}$ . With this correction, there is no need to impose a stricter constraint on $\eta$ . We will fix this in the revised version. Thank you again for your careful reading.

Line 598: Please give some more explanation for the expression for delta^t.

A: Thanks for the comment. In the A2L reduction with gradient feedback, we get gradient feedback $\bar{u}^t:= u\_i(\cdot, \bar{x}\_{-i}^t) = \frac{1}{t} \sum\_{k=1}^t u\_i(\cdot, x\_{-i}^k)$ and then recover the true gradient $u^t:= u_i(\cdot, x^t_{-i})$ via the identity $u^t = t \cdot \bar{u}^t - (t-1) \bar{u}^{t-1}$ , which follows from the linearity of the utility.

In the bandit setting, however, we do not observe $\bar{u}^t$ directly. Instead, we obtain an estimate $\hat{U}^t \approx \bar{u}^t$ (Line 6, Algorithm 3). Letting $\Delta^t := \hat{U}^t - \bar{u}^t$ denote the estimation error in this step. Then the approximation $\hat{u}^t := t \cdot \hat{U}^t - (t-1) \cdot \hat{U}^{t-1}$ (Line 7, Algorithm 3) differs from the true $u^t$ by

\delta^t = \hat{u}^t - u^t = t \cdot \hat{U}^t - (t-1)\cdot \hat{U}^{t-1} - t \cdot \bar{u}^t + (t-1) \cdot \bar{u}^{t-1} = t \Delta^t - (t-1) \Delta^{t-1}.

This explains the expression for $\delta^t$ . We will include this clarification in the revised version. Thank you again for your thoughtful question.

It would have been nice to see the theoretical results complemented with numerical experiments, e.g., comparing OMWU with A2L-OMWU on some interesting games.

A: Thanks for the suggestion! We will include numerical experiments on comparing OMWU and A2L-OMWU on the hard instance of [1]. As shown in [1], OMWU could have arbitrarily slow last-iterate convergence on these hard instances. But A2L-OMWU is equivalent to the average iterates of OMWU and would have a fast $O(1/T)$ last-iterate convergence rate. As also suggested by other reviewers, we will also include numerical experiments on comparing A2L-OMWU and EG/OG where the former is more favorable due to its logarithmic dependence on the number of actions.

[1] Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Weiqiang Zheng. “Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

评论- Response to rebuttal

2025-08-07

Thank you for your detailed comments on my review. The authors have addressed my concerns well.

I am satisfied with the correction of the inequality. I find the clarification of the step in line 598 satisfactory. In overall, I believe the readability and accessibility of the paper would increase if the authors utilise some of the extra page in the camera ready version to spell out more details in the proofs. I am pleased to see that the authors will include some numerical experiments. As mentioned in my review, it is interesting to see whether the theoretical approach described would lead to practically useful algorithms.

审稿意见

评分: 4置信度: 42025-07-03

This paper proposes A2L, a black-box reduction that transforms any online learning algorithm with average-iterate convergence guarantees into one whose last iterates converge. The reduction applies to games with payoffs linear in players’ strategies, including zero-sum bimatrix games and multi-player polymatrix games. Furthermore, they show that applying A2L to Optimistic Multiplicative Weights Update improved last-iterate convergence rates and the fastest known convergence under bandit feedback.

优缺点分析

Strengths:

The black box reduction is simple but able to provide stronger last-iterate convergence rates for zero-sum games, i.e., they get a last-iterate rate of $O(\log d /T)$ , when they apply their black-box reduction to OMWU.
They also get improved rates for the gradient feedback and the bandit case for zero-sum games, where they show that the ``reduced'' versions of Bandit-OMWU achieve rates of $O(T^{-1/5})$ , compared to prior approaches that achieve O( $T^{-1/8}$ ).

Weakness:

They already state that their techniques are limited to games that satisfy the Linear utilities assumption (Assumption 1), such as bimatrix games and polymatrix games.
Although it is mainly a theory paper, it would be nice to see how the A2L-OMWU performs in practice against OMWU and OGDA for instance.

See the questions section as I would like to seek further clarification.

问题

The reduction reminds me of the paper by Cutkosky (2019), where they do anytime online to batch conversion. Can the authors comment on any similarities/differences with their approach. In general, I think it would be useful to cite it.
Following from the previous question Cutkosky (2019) have a weighted average version in their reduction and I was wondering, if your reduction can also have such weights embedded and whether that could be useful?
It is known that Cai et al. (2024) have this game defined as $A_{\delta}$ in their paper that shows the lower bound for OMWU. If I understand your results correctly, the A2L-OMWU, in its rates does not have a game dependent constant? If so this must be highlighted.
The reduction was shown through OMWU, but can it be applied to standard (non-optimistic) versions to get last-iterate convergence? Is it also applicable to general FTRL?
It would be nice to see some experiments on large-scale games to see the performance of A2L-OMWU, OMWU and OGDA in practice, with respect to number of steps and also the wall-clock time.
Could the authors also clarify the dependence on the other constants when compared to Cai et al. (2023)? in Theorem 4?

References:

Cutkosky, Ashok. "Anytime online-to-batch, optimism and acceleration." International conference on machine learning. PMLR, 2019.

Yang Cai, Haipeng Luo, Chen-Yu Wei, and Weiqiang Zheng. “Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 36364–36406.

Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Weiqiang Zheng. “Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

局限性

最终评判理由

While I appreciate the theoretical results, it would may be help to see experimental comparisons, especially on wall-clock times between the reductions and using the original algorithms, to give a better view of the landscape of these algorithms for such games. In general, I would like to maintain my original (positive) score of the paper as I think it would be good for the community to explore this idea beyond games with linear utilities.

格式问题

作者回复

2025-07-31

We thank the reviewer for appreciating the simplicity of our reduction and the strong convergence rate results. We also thank the reviewer for constructive comments and pointing a related work. We address the questions and comments below.

The reduction reminds me of the paper by Cutkosky (2019), where they do anytime online to batch conversion. Can the authors comment on any similarities/differences with their approach. In general, I think it would be useful to cite it.

A: Thank you for pointing us to the work of Cutkosky (2019). While our A2L reduction and the anytime online-to-batch (AO2B) reduction share the idea of playing the average of previous iterates, they differ significantly in several key aspects:

Problem setting: AO2B is designed for static optimization, where the loss function remains fixed across rounds. A2L, on the other hand, applies to a dynamic multi-agent game setting, where each player’s loss function can change over time due to the actions of others.
Objective: AO2B aims to solve a single offline optimization problem with improved convergence rates. AO2B does not guarantee equivalence between the last iterate of the new learning algorithm and the average iterate of the original algorithm (since the gradient feedbacks are different). In contrast, A2L aims to show that the last iterate of the new dynamics corresponds to the averaged iterate of the original dynamics.
Use of gradient: In AO2B, the learner updates directly using the gradient at the average iterate $\bar{x}^t = \frac{1}{t} \sum_{k=1}^t x^k$ . In contrast, A2L requires post-processing the gradient feedback to recover the true gradient at the original iterate $x^t$ , as shown in Line 4 of our Algorithm 1.

We will include this comparison with Cutkosky (2019) in the revised version of the paper. Thank you again for the helpful suggestion.

Following from the previous question Cutkosky (2019) have a weighted average version in their reduction and I was wondering, if your reduction can also have such weights embedded and whether that could be useful?

A: Thank you for the insightful question!

Our A2L reduction naturally extends to the weighted averaging setting, provided all players agree on the weights in advance and incorporate them into the reduction. In this version, each player plays the weighted average of their past iterates. Due to the linearity of utilities, the gradient feedback can still be used to recover the gradient corresponding to the original iterate, as in the current A2L reduction. As a result, the last iterate of the new dynamics corresponds to the weighted average in the original dynamics. The correctness follows the same steps as in Theorem 2.

This weighted version of A2L provides additional flexibility and can be beneficial in practice. For example, in regret matching algorithms—widely used for solving large-scale games like poker—linear averaging (i.e., giving the $k$ -th iterate weight proportional to $k$ ) often empirically outperforms uniform averaging, despite having the same theoretical rate. We will include and discuss this weighted variant in the revised version. Thank you again for the suggestion.

It is known that Cai et al. (2024) have this game defined as $A_\delta$ in their paper that shows the lower bound for OMWU. If I understand your results correctly, the A2L-OMWU, in its rates does not have a game dependent constant? If so this must be highlighted.

A: A2L-OMWU’s last-iterate convergence rate is uniform and does not have a game-dependent constant. Since our A2L reduction transforms average iterates of OMWU to the last iterate, the last-iterate convergence rate of A2L-OMWU is the same as the average-iterate convergence rate of OMWU. Then it follows from the fact that OMWU’s average-iterate convergence has no game-dependent constant (as shown in our Theorem 3), despite that its last-iterate convergence rate must have some game-dependent constant (as shown by Cai et al. (2024)).

The reduction was shown through OMWU, but can it be applied to standard (non-optimistic) versions to get last-iterate convergence? Is it also applicable to general FTRL?

A: Our A2L reduction is a black-box framework that applies to any online learning algorithm. For example, applying A2L to the standard (non-optimistic) version of MWU—which has $O(\sqrt{T})$ regret—yields an A2L-MWU algorithm with an $O(1/\sqrt{T})$ last-iterate convergence rate. More generally, A2L can be applied to any FTRL-based algorithm, resulting in an $O(1/\sqrt{T})$ rate for non-optimistic versions and an improved $O(1/T)$ rate for optimistic FTRL algorithms.

It would be nice to see some experiments on large-scale games to see the performance of A2L-OMWU, OMWU and OGDA in practice, with respect to number of steps and also the wall-clock time.

A: Thanks for the suggestion. We will include numerical experiment results in the revised version. We would like to remark that OMWU could have an arbitrarily slow last-iterate convergence even in simple games, where our reduction A2L-OMWU gives fast last-iterate convergence. Also note that A2L-OMWU enjoys a logarithmic dependence on the dimension (i.e., the number of actions), whereas OGDA exhibits polynomial dependence. As a result, A2L-OMWU may be particularly preferred in high-dimensional settings.

Could the authors also clarify the dependence on the other constants when compared to Cai et al. (2023)? in Theorem 4?

A: Cai et al. (2023) study only the two-player zero-sum setting ( $n = 2$ ) under bandit feedback and establish a high-probability convergence rate of $O(d^{1/2} \ln^{3/2}(dt/\delta) t^{-1/8}),$ where $d$ is the maximum number of actions, $t$ is the number of iterations, and $\delta$ is the failure probability. In comparison, our Theorem 4 provides a rate of $O(d^{1/5} \ln^{11/5}(dt/\delta) t^{-1/5})$ for the same setting. Our result improves the dependence on all parameters— $d$ , $t$ , and $\delta$ —and thus offers a better high-probability guarantee.

Weakness: They already state that their techniques are limited to games that satisfy the Linear utilities assumption (Assumption 1), such as bimatrix games and polymatrix games.

Response: While our current analysis focuses on games satisfying the linear utilities assumption (e.g., bimatrix and polymatrix games), this assumption is a sufficient condition for applying our reduction—not a necessary one. We recently found that the A2L reduction can apply to other structured settings with possibly nonlinear utilities. We will first recall the core ideas of A2L reduction and then give one example of learning dynamics for market equilibrium.

The core of the A2L reduction lies in the “extraction” step: when players play the average of past iterates and observe corresponding feedback, they recover the feedback they would have received had they all played the original iterate. In our current framework, this is enabled by linearity, but similar structure-based recoverability can exist beyond linear settings.

One example is the proportional response dynamics (PRD) for market equilibrium in Fisher markets [Cheung et al., 2025] (where utilities can be nonlinear). In a fisher market, there are $n$ agents and $m$ goods. Each agent $i$ has a budget $e_i$ and each good $j$ has a unit supply. The proportional response dynamic works as follows:

Budget Spending: each player $i$ decides the allocation of budget $\{b^t_{ij} \}$ where $b^t_{ij}$ is his spending on good $j$ . It must hold that $\sum_j b^t_{ij} = e_i$ ;
Goods allocation: the amount of good $j$ allocated to agent $i$ is proportional to their spending as $x^t_{ij} = b^t_{ij} / \sum_{i} b^t_{i, j}$ . Moreover, the price of each good $j$ is defined as the sum of agents’ spending $p^t_j:= \sum_{i} b^t_{ij}$ .
Update: Based on the allocation $x^t_{ij}$ , each buyer then updates their spending of budget $\{ b^{t+1}_{ij} \}$ in the next round using certain rules.

The original PRD for GS utility has a $O(1/T)$ convergence rate only in terms of the average price, but not the last iterate price [Cheung et al., 2025]. However, by applying A2L reduction— let each agent spend their average budget allocation $\bar{b}^t\_i = \frac{1}{t} \sum_k b^k\_i$ and observe the induced allocation $\bar{x}^t$ ---we can recover the original iterate's allocation due to the proportional structure. Thus, we get a modified PRD with $O(1/T)$ last-iterate convergence in price. Here, the key observation is that the feedback in PRD is the allocation $x$ , which has a nice proportional structure with the bidding $b$ despite each player having nonlinear utilities.

References

Cutkosky, Ashok. "Anytime online-to-batch, optimism and acceleration." International conference on machine learning. PMLR, 2019.

Cheung, Yun Kuen, Richard Cole, and Yixin Tao. "Proportional Response Dynamics in Gross Substitutes Markets." Proceedings of the 26th ACM Conference on Economics and Computation. 2025.

评论- Thank You for the Clarifications

2025-08-02

I thank the authors for the detailed responses and some new connections to PRD etc. All in all , this is quite interesting and the simplicity it seems like displaced a lot of last-iterate convergence results for the past 8 years or so (at least for linear utility games and beyond). While I appreciate the theoretical results, it would may be help to see experimental comparisons, especially on wall-clock times between the reductions and using the original algorithms, to give a better view of the landscape of these algorithms for such games. In general, I would like to maintain my positive score of the paper as, I think it would be good for the community to explore this idea beyond games with linear utilities.

最终决定Accept (poster)

2025-09-17

The paper provides a simple and elegant reduction from average iterate convergence to last iterate convergence in game settings of bilinear structure (this includes 2-player games and polymatrix games). Due to this reduction, the authors achieve faster convergence rates to Nash equilibrium than the state-of-the-art for full information and bandit feedback in 2-player zero-sum games. This is an important contribution as last-iterate convergence has received a lot of attention the last decade from the research community. All reviewers appreciated the contribution and so the AC. We recommend acceptance. We note that initially the paper was recommended for spotlight, but after a discussion between AC and SAC, it was pointed out that the description of the algorithm needs improvement and the decision was bumped down.