Dynamic Discounted Counterfactual Regret Minimization
摘要
评审与讨论
This paper studies the CFR framework in the setting of incomplete information games, and proposes a modification that allows for equilibrium finding using a dynamic, automatically-learned scheme. To do so, the authors formulate CFR’s iteration process as an MDP and give a transformation that frames the (CFR) discount scheme as a policy optimization problem wherein the dynamic discounting scheme is an agent that interacts with the MDP. The primary motivation of this framework is to be able to learn discounting policies that can generalize over different IIGs, rather than having to be carefully designed for specific games. In this direction, the authors design a training algorithm based on evolution strategies that can generalize well, even to games that are not seen in the training process.
优点
The core idea of the paper, namely the dynamic policy optimization procedure for learning discount schemes, is conceptually important since it allows generalization to unseen games, circumventing the need for suboptimal fixed discounting schemes. Moreover, the experiments prevented are convincing, showing not just benchmarking against current state of the art but also ablation studies which gives a stronger idea of what aspects of DDCFR makes the most improvement over standard CFR approaches. The DDCFR methodology is also shown to be effective in modifying other CFR variants, which is a useful property to have. Moreover, the paper is well written in the way that motivations are clear, and any potential concerns are discussed in fair detail (for instance, the discussion about the potential additional computational costs).
缺点
The framework introduced seems to be useful in practice and effectively stitches together several well-established ideas. A minor weakness would be that there isn’t much technical novelty, since all the results and techniques used are taken from prior works. However, this doesn’t diminish the quality of the results greatly to me. Another minor weakness is that currently the explanations in the ablation studies section feel quite vague and handwavy. My suggestion to the authors would be to highlight several key results in the table and provide a more substantiated explanation for each. It would also have been nice to see running time comparisons in the experiments to properly verify the authors’ claims about additional computational costs not causing major increases in running time.
问题
- How does the DDCFR technique compare to deep methods like Deep CFR when it comes to larger scale games?
- How does DDCFR empirically scale to games with more players? In principle, it seems to me that it would provide similar benefits compared to current CFR methods in multiagent settings but I’m curious if this is something the authors have explored.
We sincerely appreciate the reviewer for the positive recommendation of our work and the insightful feedback. Following are our responses to your concerns.
- How does the DDCFR technique compare to deep methods like Deep CFR when it comes to larger scale games?
Many thanks for your constructive comments. DeepCFR finds an approximate Nash equilibrium by combining a base CFR-type algorithm with deep neural network function approximation. For instance, the original DeepCFR paper [1] employs LinearCFR as the base CFR algorithm, which is a specific version of DCFR.
Based on these considerations, there are inherent differences in the underlying principles and objectives of DDCFR and DeepCFR. Specifically, DDCFR focuses on enhancing the performance of base CFR-type algorithms through dynamic discounting. In contrast, DeepCFR aims to use neural networks to approximate and generalize the behavior of base CFR-type algorithms. Consequently, it is not appropriate to directly compare these two approaches from our perspective.
We want to thank you again for your insightful comment, which has inspired us greatly. In particular, since our DDCFR is a strong CFR-type algorithm, it can also serve as a strong base algorithm for DeepCFR. DeepCFR, in turn, can utilize neural networks to approximate and generalize the behavior of DDCFR, thereby potentially improving its performance compared to using the original LinearCFR base algorithm. Since deep methods are not the focus of our paper, we consider this promising idea as an intriguing avenue for future research.
[1] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In International Conference on Machine Learning, pp. 793–802, 2019.
- How does DDCFR empirically scale to games with more players?
Thank you very much for your insightful question. Our work focuses on solving two-player zero-sum games, and we have not delved into the application of DDCFR in multiplayer games for the following reasons:
For two-player zero-sum games, CFR effectively computes strategies that converge to an approximate Nash equilibrium. A Nash equilibrium is useful in two-player zero-sum games since it ensures that the player maximizes its utility against a worst-case opponent, effectively avoiding losses. However, when applied to multiplayer games, CFR loses all theoretical guarantees. It is not guaranteed to converge to a Nash equilibrium, and playing a Nash equilibrium may not be wise in multiplayer games [1].
Nevertheless, CFR-type algorithms have the potential to generate strategies that empirically perform well in multiplayer games [2,3]. We believe that our DDCFR is a general framework that can enhance CFR's empirical performance in multiplayer games as well. We will explore this exciting direction in our future work.
[1] Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456): 885–890, 2019.
[2] Nicholas Abou Risk and Duane Szafron. Using counterfactual regret minimization to create competitive multiplayer poker agents. In International Conference on Autonomous Agents and Multiagent Systems, pp. 159–166, 2010.
[3] Richard Gibson. Regret minimization in games and the development of champion multiplayer computer poker-playing agents. PhD thesis, University of Alberta, 2014.
- It would also have been nice to see running time comparisons in the experiments to properly verify the authors' claims about additional computational costs not causing major increases in running time.
Thank you very much for your constructive suggestion. We have added a table in Appendix D, which compares the running time between DCFR and DDCFR. The results show that the additional computation time incurred by DDCFR is negligible compared to the overall iteration time.
Thank you for the response and answers to my questions, they were very helpful. My recommendation for acceptance remains unchanged.
Dear reviewer, thank you very much for taking the time to read our response. Once again, we sincerely appreciate your thorough review and insightful feedback.
The paper offers an interesting extension to counterfactual regret minimization inspired by reinforcement learning for learning the Nash equilibrium in imperfect information games. Specifically, the paper proposes an MDP formulation of adjusting the discount rate in DCFR algorithms, a provable guarantee for the proposed RL approach, empirical tips and tricks for speeding up the convergence of the RL algorithm, and simulation studies on incomplete information games.
优点
- The problem studied by the paper is an interesting one and the approach taken is intriguing.
- The paper provides the learned models in the supplemental materials, which make the results easier to verify.
- The paper offers a comprehensive overview of the problem setting and discusses well its contributions in context of existing works.
- The set of games considered in the experiments is comprehensive.
缺点
- The paper could be better organized. Admittedly the problem studied here is a challenging one and necessitates heavy notations, as well as a collection of specific terminologies in both AGT and RL. However, some concepts are not used (or at least used explicitly), causing the paper to appear to be overwhelmingly dense at a first read. Assuming that the paper is targeted at audience with a game theory background, perhaps it would make more sense to spend some time describing and motivating the key algorithmic contributions from the RL angle. For instance, it might be useful to defer the mathematically rigorous definitions of information set reach probability or interval history reach probability to the appendix, and keep in the main body only an intuitive description of these concepts, as the paper's key contribution lies in the MDP-based approach proposed later. This takes up space and takes attention away from more important concepts such as exploitability and "realized" exploitability .
- On are related note, there are some overloaded notations that make the paper harder to parse. For instance, the letter refers both to the actions taken by the agents in the "Game Theory" part of the paper and the action taken by the learner in the "Reinforcement Learning" part of the paper.
- For the experiments, it would be nice if the authors could provide the wall-clock times for DDCFR. DCFR has an added benefit that the algorithm requires little computation complexity at each round (the discount rates are specified beforehand) and it would be a fairer comparison to consider the computational aspect as well. (I am convinced by the arguments in Section 3.4, but adding these results will better validate the claims in Section 3.4 empirically.)
- Theorem 1 requires the action space to be discretized, yet this is not stated explicitly (also see the following section for question on boundedness of .
- It appears that experiments are only run once, yet multiple runs are required to demonstrate that the proposed approach convincingly outperform DCFR, and the figures in the paper are not due to luck alone.
问题
- Shouldn't there be (at the very least) some mild assumptions on boundedness of ? In the degenerate case, suppose the initial policy always select and some really poor choice of discount rate. Will the algorithm still converge at the rate provided in Theorem 1?
- Purely out of curiosity, why should we directly "jump" to RL which, as the paper points out, "notably sensitive to hyperparameters". Does it make sense to consider a bandit-based approach for optimizing the discount rates?
We sincerely appreciate the insightful feedback and your efforts in reviewing our paper. Following are our responses to your concerns.
- The paper could be better organized.
Thank you very much for your valuable suggestion. Section 2 of the current paper contains a significant number of formulas, potentially imposing a heavier reading burden on the audience. Following your advice, we have carefully restructured Section 2. In this revised version, we provide only intuitive descriptions of key concepts, such as counterfactual values and reach probabilities, in Section 2. We have moved the mathematically rigorous definitions of these terms to Appendix A, as well as the related terms relationship and player function . Furthermore, We have simplified the definition of the range of utilities . We believe that this updated version will enhance the paper's accessibility for researchers in both the AGT and RL communities.
- There are some overloaded notations that make the paper harder to parse.
Thank you very much for pointing out the issue with overloaded notation making certain parts of the paper ambiguous. Using unique symbols eliminates confusion and improves readability.
Following your suggestion, we have made the following modifications.
- We use to denote the action of the RL agent and to distinguish it from the action in the game tree.
- We use to denote the reward function in MDP and to distinguish it from the cumulative regret in CFR.
- We use to denote the noise standard deviation in ES and to distinguish it from the strategy in the game.
- We use to denote the population size in ES and to distinguish it from the state transition in MDP.
- We uniformly use to denote one iteration and to denote the total number of iterations.
- For the experiments, it would be nice if the authors could provide the wall-clock times for DDCFR.
Thank you very much for your constructive suggestion. To further support claim in Section 3.4, we have added a table in Appendix D, which compares the wall-clock time between DCFR and DDCFR. The results show that the additional computation time incurred by DDCFR is negligible compared to the overall iteration time.
- Theorem 1 requires the action space to be discretized, yet this is not stated explicitly (also see the following section for question on boundedness of )
Thank you very much for your comments. It seems that there may be a slight misunderstanding regarding Theorem 1. Please allow us to provide a more detailed explanation.
First of all, Theorem 1 supports a continuous action space, as along as actions are within a certain range. For example, in every iteration , can take on any real value between -5 and 0. By the way, our DDCFR also adopts a continuous action space design. The ablation studies in Section 4.3 demonstrate that the continuous action space design outperforms the discrete action space design.
On the other hand, the algorithm still converges even when . As , DDCFR never adjusts the hyperparameters and always uses the initial discount rate as you mentioned. In other words, . As long as are within the range specified by Theorem 1, it still satisfies the requirement of Theorem 1. Therefore, the algorithm still converges at the rate provided in Theorem 1.
In fact, we can view DCFR as a special case in this setting, where . This specific configuration meets the requirements of Theorem 1.
In summary, Theorem 1 indicates that numerous discounting schemes converge in theory as long as the discounting weights in each iteration are within a certain range. The primary contribution of our DDCFR is the discovery of a dynamic and automatically learned scheme.
- It appears that experiments are only run once, yet multiple runs are required to demonstrate that the proposed approach convincingly outperform DCFR, and the figures in the paper are not due to luck alone.
Thank you very much for your comments. We want to emphasize that there is NO randomness throughout the iteration process of DDCFR when using the learned discounting policy on the testing games. Specifically, DDCFR always initializes with the uniform random average strategy, and the policy network consistently produces the same fixed action when inputs are determined. Put simply, each run of DDCFR with the learned model consistently generates the same exploitability curve.
We use ES to learn the discounting policy, and we select one of the best performing policy on the training games for testing. In response to your concerns, we independently ran the ES training process 5 times and tested the performance of the resulting 5 discounting policies. In Appendix F of the revised paper, we have added figures to show the average performance of DDCFR. The results show that DDCFR is still superior to other baselines. This proves that the superiority of DDCFR is not due to luck.
- Does it make sense to consider a bandit-based approach for optimizing the discount rates?
As far as we know, in a typical bandit scenario, there is usually a single state with multiple available actions. However, in our setting, the agent takes actions across multiple states, ranging from to . The optimal action varies from one state to another. Therefore, based on our understanding, the bandit-based approach is less suitable for our scenario.
Thank the authors for their detailed feedback and timely updates to the manuscript. I have increased my score.
I think it might be useful to consider clarifying that does not mean the cardinality of a finite set, which seems to be a more commonly used definition in RL. Same for emphasizing that runs in the experiment section are deterministic as opposed to stochastic, which again is more common in RL. These are really, really minor nitpicks.
Dear reviewer, thank you very much for reviewing our response and updating the score. We genuinely appreciate your meticulous reading and insightful comments.
- I think it might be useful to consider clarifying that does not mean the cardinality of a finite set, which seems to be a more commonly used definition in RL.
Thank you very much for your good suggestion. We have incorporated the clarification in section 2.1 of the revised paper: With a slight abuse of the cardinality notation , we use to denote the maximum number of actions within each information set, i.e., .
- Same for emphasizing that runs in the experiment section are deterministic as opposed to stochastic, which again is more common in RL.
Thank you very much for this suggestion. We have also clarified this in section 4.2 of the revised paper: It is worth noting that, unlike agent performance testing in RL, which usually requires multiple runs due to the randomness in the environment, the iteration process of DDCFR is deterministic (Appendix F). Therefore, we only need to test it once in each game.
Thank you once again for your invaluable contributions to our work.
The paper introduces a Reinforcement Learning framework to manage the decaying parameters of Discounted CFR dynamically. In order to do so, the paper proposes a game-agnostic neural policy that receives as input the iteration and current exploitability and outputs DCFR's parameters and a time window for which those parameters will be used. The proposed algorithm is proven to maintain the same (ignoring constants) regret guarantees than CFR, and experiments validate the approach of training a general policy in simple games and then evaluating it on a suite including much larger games. Ablations are provided, showing the value of each piece of the algorithm presented.
优点
- simple idea. The whole paper is built on an initial observation and a single concept. This makes the paper follow a linear and clear structure
- clear presentation. The paper is well written and answers almost all the questions one may have on the topic.
- clear and sound experimentation framework. Experiments answer all the questions.
- good empirical results. The results are encouraging and worth of publication
缺点
- the tradeoff between computational power and significance of the results is arguable, While I agree on the amortized costs of training on small games and use the same policy on larger instances, the computation of accurate exploitability is a hard constraint of the technique as presented in the paper. While I think that this is the largest weakness on the scalability of the technique, I do not think that such a constraint should be circumvented in this paper.
- the comparison with RL algorithms could be explored a bit more. I'm convinced that PPO is a bad choice in the given setting due to long horizons and sparse rewards, but I'm not convinced on the multi-task nature of the RL challenge at hand. In my view, the task is identical and there is just stochasticity induced by the distribution over games during training. I'd expect a multi-task problem to explicitly pass to the policy the task as observation in order to condition the policy. My suggestion is to expand a bit more on the topic in the paper.(e.g. why wouldn't sset gamma=1 solve the long horizon problem?)
Minor suggestions:
- Is equation 4 wrongy formatted? Especially the parentheses and the symbols
- add the default DCFR parameters as a constant line in the plots in Figure 4
问题
- in section 3.4, you say that the exploitability computation can be done in parallel. I'm not conviced about this, since to run a CFR computation you need the parameters, which depend on the explotability of the average strategy after the latest iteration. Am I missing something? The only parallelizable step is the computation of the gradient in the ES subroutine
- what happens in the first 500 iterations of DDCFR? Why are those hidden?
- The comparison with RL algorithms could be explored a bit more.
Thank you very much for the insightful feedback. When formulating CFR's iteration process as an MDP, applying RL to optimize the discounting policy seems to be the most natural and straightforward approach. However, we found that achieving meaningful results using RL (e.g., PPO) is challenging and exhibits significant instability.
In fact, multitask learning is an important issue in our setting. The training games are diverse in size with different converge speed, balancing the learning across different task in multi-task RL is difficult and requires extensive hyperparameter tuning. By the way, it seems like that explicitly conditioning the policy on game IDs may not generalize well to unseen testing games.
Given these challenges, our current approach opts for the simple, scalable and robust ES to optimize the policy. Exploring more advanced multitask RL algorithms and automatic hyperparameter tuning techniques to optimize the policy more efficiently is a promising direction for future work.
- Is equation 4 wrongly formatted.
Thank you very much for your thorough review. In the revised paper, we have made corrections to Equation 4, ensuring that the parentheses and symbols are correctly formatted. We have also diligently reviewed and adjusted the remaining formulas to ensure consistency.
- Add the default DCFR parameters as a constant line in the plots in Figure 4.
Thank you for your suggestion to enhance Figure 4. We have updated the figure in the revised paper to include a constant line representing the default DCFR parameters, which helps readers more easily interpret the results and compare DDCFR's dynamic discounting scheme with DCFR's fixed scheme.
- What happens in the first 500 iterations of DDCFR? Why are those hidden?
Due to the rapid decline in exploitability in the early iterations of different CFR algorithms, their differences are primarily manifested in the later iterations. Consequently, for a better comparison of the algorithms, we focus on the later iterations.
For the sake of completeness, in Appendix E of the revised paper, we have added figures to include all iterations.
Thanks or your answers, they satisfy my questions
Dear reviewer, we sincerely appreciate your time and effort in reviewing our response. We are pleased that our reply has addressed your concerns. Thank you once again for your detailed review and constructive feedback to our work.
Thank you very much for your positive recommendation of our work and your insightful comments. Following are our responses to your concerns.
- Why the exploitability computation can be done in parallel.
We sincerely apologize for any confusion arising from our inaccurate description of "the exploitability calculation and the CFR iteration process are independent". In fact, what we are trying to convey is that the exploitability calculation of the average strategy at iteration and the instantaneous regrets calculation at iteration are independent. Allow us to provide a more detailed clarification.
On each iteration , our DDCFR first recursively traverses the game tree using the strategy to compute the instantaneous regret . Then, DDCFR updates the cumulative regrets , the average strategy , and the new strategy using the weights generated by the discounting policy . Specifically, , where is the normalization of iteration and is the normalized exploitability of the average strategy at iteration .
Calculating requires traversing the game tree using the average strategy , which leads to additional time overhead under naive implementation. However, it's crucial to note that we can calculate and compute the instantaneous regret in parallel since these two processes depend on two different strategies, and , and each requires traversing the game tree once. Consequently, there is no additional wall-clock time overhead associated with the calculation of exploitability in our parallel implementation.
The only time overhead of our DDCFR is the policy inference time, i.e., calculating . However, this cost is negligible compared to the overall iteration time. Please refer to appendix D for more details.
We genuinely appreciate your insightful comment, which has prompted us to refine the description in Section 3.4. Additionally, we have corrected the notation for normalized exploitability from to in Section 3.1 of the revised paper.
- The scalability of the proposed technique.
Our DDCFR algorithm is founded on the state-of-the-art DCFR algorithm. Consequently, the scope of application for our DDCFR is identical to that of DCFR. When scaling up to larger games, we can leverage numerous well-established techniques, such as decomposition [1] and subgame-solving [2,3], to mitigate problem complexity.
Taking inspiration from your valuable feedback, we plan to explore how to integrate the concept of dynamic discounting into decomposition and subgame-solving techniques to address larger-scale imperfect information games in the future.
[1] Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In AAAI Conference on Artificial Intelligence, pp. 602–608, 2014.
[2] Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems, pp. 689–699, 2017.
[3] Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems, pp. 7674–7685, 2018.
The paper introduces a dynamic weighting scheme for iterates of CFR, employing an MDP to determine optimal weights based on the algorithm's state. The empirical results support the effectiveness of this approach. Their scheme/framework is modular and can be applied to state-of-the-art CFR variants including DCFR and PCFR.
优点
I reviewed a previous version of this paper.
The reviewers have adequately addressed the concerns brought up by the reviewers in this version of the paper. This includes applying their framework to PCFR and comparing DDCFR directly to PCFR, as well as comparing to Greedy Weights. Additionally, they provide a convergence analysis of their algorithm.
缺点
The algorithms' numerical performances are depicted starting at 500 iterations. It would be nice to show the performance of the algorithms in early iterations.
While the paper notes that the time spent training the network to be used for the dynamic weight calculation is amortized by the number of times the policy computed is applied in application of the framework to equilibrium computation in different games, it never explicitly states how much time was spent training the network.
问题
I would suggest using a log-scale on the x-axis and start the x-axis at the first iteration. This allows easy comparison of the algorithms being compared and makes clear both short-term performance and long-term performance. It is confusing that the exploitability starts at small values at the beginning of the graph and then doesn't seem to change much in Kuhn, for example.
Could you note the actual amount of time required to training the policy in Section 3.4? Additionally, I would suggest ordering "(i)", "(ii)", and "(iii)", in the same order you state them in Line 264.
Thank you very much for your constructive comments and acknowledgment of our efforts. Your previous feedback has been immensely helpful in improving our work's quality. Following are our responses to your concerns.
- The algorithms' numerical performances are depicted starting at 500 iterations. It would be nice to show the performance of the algorithms in early iterations.
Following your valuable suggestions, we have added figures to include all iterations in Appendix E of the revised paper.
- Could you note the actual amount of time required to training the policy in Section 3.4?
Thank you for your helpful suggestion. We should have explicitly stated the training time for the policy to provide key details of the overall approach.
Specifically, we trained the agent for 1000 epochs across 200 CPU cores, which took approximately 24 hours. Per your advice, We have added this information at the training details in the revised paper.
- I would suggest ordering "(i)", "(ii)", and "(iii)", in the same order you state them in Line 264.
Following your suggestion, we have unified the order in section 3.4 of the revised paper.
Thank you for your response, especially for including the figures and noting the training time. I confirm my score.
Dear reviewer, we extend our heartfelt gratitude for providing insightful suggestions. Your invaluable contributions have greatly enhanced the quality of our work, and we sincerely appreciate your thoughtful feedback.
This is a solid paper that proposes a new dynamic weighting scheme for CFR. I recommend the paper be highlighted at the conference.
为何不给更高分
My recommendation is for this paper to be selected for either spotlight or oral presentation.
为何不给更低分
I believe the paper deserves being highlighted.
Accept (spotlight)