Thresholds for sensitive optimality and Blackwell optimality in stochastic games
摘要
评审与讨论
This paper studies how to find d-sensitive and Blackwell optimal strategies in two-player zero-sum perfect-information stochastic games. It provides new bounds on the d-sensitive threshold ( ) and the Blackwell threshold . These thresholds are important because they determine the complexity of finding these strategies by solving standard discounted games.
The paper's main contributions are the first bounds for when and improved bounds for . The authors use three techniques from algebraic number theory to derive their results: Lagrange bounds, Mahler measures, and root multiplicity bounds. They present results for both deterministic and stochastic games, showing which technique is best for different parameter settings.
优缺点分析
Strengths
- The paper provides the first explicit bounds for the d-sensitive threshold (for ), which is a new contribution to the field.
- The use of advanced tools from algebraic number theory (Lagrange bounds, Mahler measures, multiplicity theorems) is innovative and leads to stronger results than prior work.
- Clear Improvements: The new bounds on the Blackwell threshold improves over existing results, particularly in their dependence on the number of states, .
- The paper is well-organized. It clearly presents the problem, the methods, and the results. The summary tables are helpful for comparing the different bounds.
Weaknesses
- The improved bounds still lead to algorithms with superpolynomial complexity. Therefore, the work does not yet yield efficient, practical algorithms for finding Blackwell optimal strategies.
- The bound on in the general stochastic case requires a unichain assumption, which limits the result's applicability.
- The work is entirely theoretical, with no experiments to indicate how tight the bounds are in practice.
问题
Questions
- The multiplicity-based method fails for stochastic games (Remark 4.5). Can you provide more intuition on why the analysis doesn't carry over from the deterministic case?
- The paper offers three different bounds. Is this necessary due to the nature of the problem, or could a unified theory combine these techniques for a single, stronger bound?
- What are the main obstacles to extending the bound to general multi-chain games without the unichain assumption?
- Does your analysis offer any insight into what a "hard" game instance (i.e., one requiring a very high discount factor) might look like?
局限性
The paper correctly identifies its main limitations.
- The bounds do not yet lead to polynomial-time algorithms.
- The bound for stochastic games is limited to the unichain case.
- The analysis is entirely theoretical, lacking empirical validation.
最终评判理由
The rebuttal is satisfactory, and I am going to maintain the positive rating.
格式问题
I did not find any major issues.
We thank you for taking the time to review the paper. We address your questions below.
Response to questions.
Q1: multiplicity-based approach. The multiplicity approach relies on bounding the multiplicity of as a root of the polynomial representing the difference of value functions (since this multiplicity leads to a bound on the minimal value of satisfying that any -sensitive optimal strategy is also Blackwell optimal). A known result shows that given a polynomial of degree such that , then this multiplicity is bounded by (this is in lines 567 - 569 of our paper).
For deterministic games, we can ensure that remains small (using Lemma 3.1 in our submission), allowing us to obtain the meaningful bound by this technique. However, for general stochastic instances, the magnitude of can grow as (with = denominator in the transitions probabilities and = number of states). This leads to a vacuous bound () that is no better than existing ones (it is known that any -sensitive optimal strategy is also Blackwell optimal). This is why the multiplicity approach breaks down for general games.
Q2. On our three different approaches. We found that our three proof techniques (Lagrange, Mahler, Multiplicity) each leverage different aspects of the problem, with different strengths and weaknesses:
- The Lagrange bound can use the fact that some coefficients are zero (which is useful for the analysis of d-sensitive thresholds as in Theorem 3.3) and always holds.
- The Mahler bound is stronger than Lagrange in some regimes (ex: ) but not in others ().
- The multiplicity analysis provides new results for MDPs and deterministic games (a better bound on such that d-sensitive strategies are also Blackwell optimal), independent of the problem of bounding the Blackwell thresholds
As evident from the previous items, there are several ingredients entering into the problems that we address in this paper. In particular, our bounds depend on:
- the number of states and the “granularity” of the games (integers and in our paper)
- the deterministic or stochastic character of the Markov chains induced by strategies
- the multiplicity of as a root of some polynomials associated with the games
Additionally, we focus on several different objective functions (Blackwell optimality, d-sensitive optimality for ). While a unifying theory is desirable, we found no single method that yields uniformly tight bounds across all these variables. Instead, our results map out which tool performs best in each regime (as highlighted for instance in lines 120 - 126 or in Table 3).
Q3. From unichain to multichain. The unichain assumption allows us to neatly relate conditions on the value function differences with some conditions on the polynomial with a manageable structure (lines 636 - 639 on page 17, in Appendix C). More specifically, under the unichain assumption, we can write with and . From this, we can reformulate the definition of the d-sensitive optimality (Equation (2) in Definition 2.2 on page 5): as some conditions on the polynomial : Controlling this polynomial in all generality (for multichain models) is the main difficulty in removing the unichain assumption. Generalizing to the multichain settings would require a finer characterization of the spectral properties of the induced Markov chains under arbitrary strategies, which is a challenging open problem. We will clarify this further and make it more explicit in our revised manuscript.
Q4. On hard game instances. This is a fundamental question, which we list as a next step in our open questions subsection (page 9), where we discuss the importance of finding a lower bound on (and therefore a matching instance). Our analysis suggests that the hardest instances should be stochastic and multichain and should involve intricate transition structures. More fundamentally, our techniques allow us to relate questions about stochastic games/MDPs (how close to can be, for a stochastic game with states, integer rewards bounded by , and transitions probabilities with common denominator ?) with questions from polynomials and algebraic number theory (how close to can a real root of a polynomial be, if has degree and its coefficients are bounded by ?). Thus, our approach provides a stronger connection between games and the rich literature on polynomials, which we plan to further exploit for demonstrating lower bounds.
On limitations. We acknowledge the methodological nature of our work. However, we believe our work marks a substantial step forward in the analysis of long-term decision-making: we significantly improve the best-known bound for the Blackwell threshold by reducing the exponent by a full order of magnitude, and we provide the first explicit bounds for the d-sensitive thresholds. To achieve this, we introduce new analytical tools—most notably, the use of Mahler measure bounds, which had not been previously applied in this setting, and a multiplicity-based approach that sharpens our understanding of when d-sensitive strategies become Blackwell optimal. These techniques go beyond prior methods and offer tighter, more general bounds across a variety of regimes.
More broadly, our results contribute to a fundamental and long-standing question in reinforcement learning and game theory: how farsighted must an agent be to act optimally in the long run? This question underpins many real-world applications, from planning under uncertainty to multi-agent systems, and sharp answers have been elusive. Our paper makes concrete progress in this direction. We view this as an essential step toward more principled long-horizon decision-making in learning systems. It is also worth emphasizing that methodological contributions to game theory (broadly defined) have been a substantial contingent of the papers accepted at ML conferences in recent years, even attracting awards, e.g., best-paper award at AAAI 2023 for mean-payoff robust MDPs [0] and ICML 2023 [1], or spotlights at NeurIPS 2022 [2,3]. These kinds of methodological advances form the backbone of many prominent successes that made the headlines in the newspaper recently, such as Deepmind solving Go [4] and CMU solving Poker [5], and even reinforcement learning-based fine-tuning of LLMs [6].
Conclusion
We hope that we addressed your main comments. We will refine the paper based on your inputs and our response to your questions. We are looking forward to your new feedback and evaluation/score, and we remain at your disposal through the rebuttal process in case you have further questions.
References
[0] Wang, Y., Velasquez, A., Atia, G., Prater-Bennette, A., & Zou, S.. Robust average-reward markov decision processes. AAAI 2023.
[1] Fiegel, C., Ménard, P., Kozuno, T., Munos, R., Perchet, V., & Valko, M.. Adapting to game trees in zero-sum imperfect information games. ICML 2023.
[2] Cai, Y., Oikonomou, A., & Zheng, W.. Finite-time last-iterate convergence for learning in multi-player games. NeurIPS 2022.
[3] Anagnostides, I., Farina, G., Kroer, C., Lee, C. W., Luo, H., & Sandholm, T. Uncoupled learning dynamics with swap regret in multiplayer games. NeurIPS 2022.
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Hassabis, D.. Mastering the game of Go with deep neural networks and tree search. Nature 2016. [5] Brown, N., & Sandholm, T.. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 2018.
[6] Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., ... & Piot, B. (2023). Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 18.
I thank the author for the detailed response and clear answers to my questions. I am happy to maintain the positive rating.
The authors study two‑player zero‑sum perfect‑information stochastic games and ask how close the discount factor needs to be to before the optimal discounted strategies are already optimal for more farsighted criteria. In particular, they consider the notion of Blackwell optimality – strategies that remain optimal for every sufficiently close to , and an intermediate notion called d‑sensitive optimality that interpolates between mean‑payoff (for ) and Blackwell optimality (for approachng infinity). The authors use heavily algebraic tool such as Lagrange separation bound and Mahler‑measure bounds to derive new thresholds on that yield substantial quantitative improvements compared to the bounds in the prior works.
优缺点分析
Strengths.
- The paper uses sophisticated algebraic methods such as Lagrange bounds and Mahler‑measure techniques.
- The explicit bounds improve previous results by up to a factor of , where is the size of the game, for the Blackwell optimality threshold, and, for the first time, bound the threshold for -sensitive optimality for general values of . This tightens the best-known estimates for the complexities of computing Blackwell/-sensitive optimal strategies.
- The paper is technical in nature: even with the connection between Blackwell threshold and classic polynomial root separation results, to apply these results, non-trivial analysis is needed to control the coefficients of the polynomial showing up in differences of discounted value functions.
Weaknesses.
- The paper does not construct explicit games that witness a gap between their upper bounds and information‑theoretic lower bounds. It will be nice if the authors can provide even some sub-optimal game construction to show how far we are from the optimal bounds for these thresholds.
- For stochastic game, the need for unichain structure limits applicability; relaxing this assumption is an open direction acknowledged by the authors.
问题
- Does author have some guess on whether the fact that some bounds are exponentially close to are in inherent or limitations of the analysis.
- Adding game instances where thresholds are close to would help calibrate how tight the theory is.
局限性
Yes.
最终评判理由
The reviewer has adequately addressed my questions. My attitude remains positive for the paper.
格式问题
No.
We thank you for your time reviewing our paper and for your positive evaluation.
Response to your questions. Both of your questions pertain to the tightness of the bounds that we obtain. Our conjecture is that the bounds based on Mahler measures are tight, at least for deterministic games. More precisely, we believe that there exist some game instances where the Blackwell threshold is exponentially close to , with a gap close to the bound of Theorem 1.3, but we do not have a proof of this fact yet.
This conjecture is based on the known tightness of the Mahler bound for polynomials. It is shown in Dubickas~[0] that there exist real polynomials that have complex roots exponentially close to . However, whether this is true for real roots remains an open question. Indeed, in the work of Dubickas and others, the existence of complex roots close to one is obtained by a coarse control of the ``Newton polygons’’, depending on the log of moduli of the coefficients of candidate polynomials. Showing the existence of real roots close to requires a finer control (of the log of moduli of the coefficients, and also on the sign), which we do not have yet because some of the candidate polynomials are obtained in a non-explicit way (as an application of a lemma of Siegel on the existence of small integer solutions of underdetermined linear systems). Whether the bounds remain tight in the case of multichain stochastic games is also an open question. We highlight this as an important future direction of research in our open questions subsection on page 9.
We do agree that exploring explicit game instances where the thresholds are near to would be valuable and give more insight on d-sensitive and Blackwell optimality. Constructing such examples—whether worst-case, drawn from real-world scenarios, or sampled at random—is an exciting avenue we plan to investigate further.
[0] Dubickas, A. (1998). On algebraic numbers close to 1. Bulletin of the Australian Mathematical Society, 58(3), 423-434.
Conclusion: We hope that we addressed your main comments, and we thank you again for your positive score. We will refine the paper's presentation based on your input. We remain at your disposal throughout the rebuttal process in case you have further questions.
Thank you for the response. My attitude remains positive for the paper.
This paper studies refinements of the mean-payoff criterion in two-player zero-sum perfect-information stochastic games, focusing on Blackwell optimality and its generalization through d-sensitive optimality. The authors introduce new bounds on the d-sensitive threshold \alpha_d that interpolates between mean-payoff and Blackwell optimality. The paper employs the methods and toolkits from separation bounds on algebraic numbers and shows their results can improve bounds for the Blackwell threshold. These thresholds are critical for understanding the algorithmic complexity of computing optimal strategies, as solving discounted games typically requires O(1/(1−α)) iterations.
优缺点分析
Strength
- The paper did a reasonably good job introducing to me the concepts of sensitive optimality and Blackwell Optimality. The techniques of separation bounds, algebraic roots, Mahler measures, and multiplicity theorems from algebraic number theory are completely new to me, and I am surprised at how it can be used in a learning problem. There seems to be many new ideas in the analysis of this problem (especially over another neurips paper [GCP23]), though The technical analysis is well beyond my comfort zone, so I cannot make a good judgement call.
- The problem is well-motivated to generalize this fundamental concept of Blackwell optimality to sensitive optimality. The cited statement of “one of the pressing questions in reinforcement learning” is a good showcase.
- The paper clearly acknowledged the limitations of their results.
Weakness
- I am not sure if the problems studied by this paper fits into the theme of Neurips, or its result might be interesting to a general audience at this conference.
- A notation table can help us quickly parse through your technical results.
- Typos: the left/right square brackets throughout this paper are sometimes not matched
问题
- The results seem to be more technically heavy than a usual Neurips paper, would it be more suitable for a more theoretical audience to review such as COLT or ALT?
- Perhaps I am not the right reviewer for this paper, but I wonder what could be the insights and takeaways for practitioners in their design and analysis of real-world RL algorithms?
局限性
yes
最终评判理由
Overall, I recommend an acceptance of this paper given its solid technical depth. That said, I still have concerns whether this topics is interested to a general audience of NeurIPS, but I see this as a rather minor concern.
格式问题
n/a
We would like to thank you for taking the time to review our paper and for your constructive comments. We will revise our manuscript to fix any typos/writing issues (e.g. the brackets issues that you mentioned) and use the extra space to add a table with notations/assumptions.
Response to your questions:
1. Fit for NeurIPS.
We believe that the topic of our work makes our paper a good fit for NeurIPS; more generally, the problem of designing methods for long-term decision-making with strategic agents has broad relevance to the machine learning community.
Indeed, repeated games, MDPs, and their generalizations (like stochastic games) form the backbone of many AI systems, including prominent successes that made the headlines in the newspaper recently such as Deepmind solving Go [4] and CMU solving Poker [5], and even reinforcement learning-based fine-tuning of LLMs [6]. The specific concepts we study (Blackwell optimality and d-sensitive optimality) are directly connected with how long-term reward is modeled and optimized, a question at the heart of reinforcement learning.
From an academic standpoint, this is exemplified by the number of recent papers at NeurIPS/ICML//AAAI around the following topics: Markov games, robust MDPs, repeated games in extensive-forms, and even simpler setups such as two-player zero-sum games have all attracted a lot of attention in the last years. In fact, papers focusing on these topics even received some prizes at these conferences:
- a recent paper on average-reward robust MDPs (a setup equivalent to stochastic games) won a best paper award at AAAI 2023 [0];
- two papers on multiplayer games were oral presentations at NeurIPS 2022 [1,2];
- a paper on adaptively solving repeated games won an outstanding paper award at ICML 2023 [3].
We will take care to more explicitly connect our contributions with these lines of work in the revised version.
2. On the practical takeaways.
We will do our best to revise our paper to better highlight the main takeaways for practitioners. Our contributions in this sense are twofold:
-
Better modeling of long-term reward: The different objectives considered in this paper (Blackwell optimality and d-sensitive optimality) are refinements of the usual discounted return. Our objectives allow decision-makers to better take into account future rewards and to choose strategies that are very far-sighted (taking into account the rewards obtained after a very large number of steps).
-
Quantitative takeaway: at a high level, our main takeaway is that optimizing these long-run objectives is quite difficult but can be safely reduced to discounted problems with discount factors close to 1; our bounds on the Blackwell threshold and the d-sensitive threshold precisely quantify how close to 1 should these discount factors be.
Let us finally mention some practical domains where long-term reward matters: This is important in applications with a very long-run horizon (e.g. games played over many rounds, such as Poker [5], Atari Games [7], Diplomacy [8], etc.) or applications requiring sustainable growth objectives, e.g. applications of sequential decision-making in sustainable exploitation and harvesting [9,10]. We will use the additional space to highlight these connections.
Conclusion:
We hope that we addressed your main comments. We will improve the paper based on your inputs. We are looking forward to your new feedback and evaluation/score, and we remain at your disposal through the rebuttal process in case you have any other questions.
References
[0] Wang, Y., Velasquez, A., Atia, G., Prater-Bennette, A., & Zou, S.. Robust average-reward markov decision processes. AAAI 2023.
[1] Cai, Y., Oikonomou, A., & Zheng, W.. Finite-time last-iterate convergence for learning in multi-player games. NeurIPS 2022.
[2] Anagnostides, I., Farina, G., Kroer, C., Lee, C. W., Luo, H., & Sandholm, T. Uncoupled learning dynamics with swap regret in multiplayer games. NeurIPS 2022.
[3] Fiegel, C., Ménard, P., Kozuno, T., Munos, R., Perchet, V., & Valko, M.. Adapting to game trees in zero-sum imperfect information games. ICML 2023.
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Hassabis, D.. Mastering the game of Go with deep neural networks and tree search. Nature 2016.
[5] Brown, N., & Sandholm, T.. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 2018.
[6] Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., ... & Piot, B. (2023). Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 18.
[7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D.. Human-level control through deep reinforcement learning. Nature 2015.
[8] Meta Fundamental AI Research Diplomacy Team (FAIR)†, Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., ... & Zijlstra, M.. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science 2022.
[9] Possingham, H, & Tuck, G. Application of stochastic dynamic programming to optimal fire management of a spatially structured threatened species. In Proceedings International Congress on Modelling and Simulation, MODSIM, pages 813–817, 1997.
[10] Ekeland, I, Karp, L, & Sumaila, R. Equilibrium resource management with altruistic overlapping generations, Journal of Environmental Economics and Management, Volume 70, 2015, Pages 1-16
I thank the author for the detailed response. My concern remains but I understand these are rather minor ones. So I am happy to maintain the positive rating.
The paper obtains bounds on the d-sensitive () and Blackwell thresholds () for stochastic games. The bounds improve upon existing results. Deterministic and general stochastic games are considered.
优缺点分析
Strengths :
- A thorough analysis of the different regimes of the stochastic games parameter is provided.
- A variety of techniques are used to show the bounds.
Weaknesses:
- The comparison tables can be improved - the results of MDP should be presented separately from those of stochastic games.
- Since the MDPs may be such that the reward function depends on both the current and next states maybe a clarification for the definition of MDPs considered can be detailed, along with its affects on the results on .
问题
- The clarification regarding MDPs mentioned in the weaknesses section can be clarified.
- Is there an assumption of smoothness on the rewards function in either the MDP or the stochastic games? Or is it not needed due to the finite state space?
局限性
Yes.
格式问题
The paper is formatted well.
We appreciate your time in reviewing our paper and we thank you for your positive score. We answer your questions below.
Question 1: clarification
-
Thanks for pointing out these presentation issues. In our revised manuscript, we will use the extra space to enhance the exposition of past results and more clearly separate and highlight the differences in the results obtained for MDPs (one player) and stochastic games (two players).
-
We have introduced stochastic games where the rewards depend on the current state and the players’ actions , but not on the next state . This assumption is standard and without loss of generality - any game with rewards of the form can be converted to a game with rewards dependent only on current-state-action triples by expanding the state space to include intermediary states of the form (i,j,a,b). It is worth emphasizing that many standard textbooks and papers on stochastic games focus on the setting with instead of , e.g. the competitive MDP textbook [0] (e.g. chapter 3 there on discounted stochastic games) and the main collection of articles on stochastic games collected by Neyman and Sorin [1] (e.g. Chapter 1 by Lloyd Shapley in this book).
[0] Filar, J., & Vrieze, K. (2012). Competitive Markov decision processes.
[1] Neyman, A., & Sorin, S. (Eds.). (2003). Stochastic games and applications.
Question 2: smoothness
You are correct in noting that we do not require any smoothness assumptions in our paper. Since the sets of states and actions are finite, all functions (rewards and transitions) are discrete. Therefore, continuity or differentiability assumptions are not needed. It is worth noting that the Blackwell threshold may not exist in models with continuous state sets, and this is even the case for countable state sets, as shown in Maitra’1965 “Dynamic programming for countable state systems”. We will make this point more explicit in the paper to avoid any confusion.
Conclusion:
We hope that we addressed your main comments, and we thank you again for your positive score. We will refine the paper's presentation based on your input. We remain at your disposal throughout the rebuttal process in case you have further questions.
Thank-you for the response. I maintain support for the paper.
This paper considers two-player perfect-information zero-sum games, whereby it provides improved bounds for the Blackwell threshold, while for the -sensitive threshold, it is the first work to give results beyond the case. The results are in part based on a variety of non-trivial algebraic techniques, and though a potential concern was raised by Reviewer jDrr about the fit for NeurIPS, I believe that these techniques (along with the results themselves) could certainly be of general interest to the reinforcement learning and algorithmic game theory communities, whose presence at NeurIPS is well established. As the reviewers are additionally in support of the paper, I recommend acceptance to the conference.