PaperHub
6.3
/10
Poster4 位审稿人
最低4最高9标准差1.9
4
5
9
7
3.8
置信度
COLM 2024

Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

OpenReviewPDF
提交: 2024-03-22更新: 2024-08-26
TL;DR

LLMs for imperfect information games

摘要

关键词
Imperfect Information GamesLarge Language ModelsTheory of Mind

评审与讨论

审稿意见
4

The paper propose Suspicion-Agent, a GPT-4 powered method for playing imperfect information games. A core feature of Suspicion-Agent is that it leverages second order theory of mind (ToM) for planning. Suspicion-Agent is evaluated on the Leduc Hold’em, which is a smaller version of Limit Texas Hold'em, against four typical decision-making methods, including NFSP, DQN, DMC, and CFR+. Experimental results show that Suspicion-Agent outperforms all baselines except CFR+.

接收理由

  1. The problem of playing imperfect information games with LLMs is interesting and of great value.
  2. The proposed Suspicion-Agent is effective compared with baselines.

拒绝理由

  1. It is stated in abstract that Suspicion-Agent "demonstrates remarkable adaptability across a range of imperfect information card games" and "the capabilities of Suspicion-Agent across three different imperfect information games" are qualitatively showcased. However, I only find the results on the Leduc Hold’em game.
  2. Important related works are not discussed [1,2]. Theory of mind is also leveraged in [1] and Avalon is also an imperfect information game. [1] has to be discussed or even used as a baseline.
  3. I think at least the core part of Related Work has to be moved to the main content. Placing the entire Related Work section in appendix is not acceptable. Note that the reviewers are not required to read the appendix section (https://colmweb.org/cfp.html#:~:text=Authors%20may%20use%20as%20many%20pages%20of%20appendices%20(after%20the%20bibliography)%20as%20they%20wish%2C%20but%20reviewers%20are%20not%20required%20to%20read%20the%20appendix).
  4. None of the existing works on playing imperfect information games with LLMs is mentioned in introduction.
  5. Compared with existing work, the method proposed in this work is somewhat incremental.

[1] Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation.

[2] Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game.

作者回复

Thanks for the comments. We provide the following response to address your concerns about paper writing, and we respectfully hope you can consider the response to the final decision.

1.However, I only find the results on the Leduc Hold’em game.

A: We will make it clear in our final version that we quantitatively evaluate our method on Leduc Hold’em game, and qualitatively observe its adaptability across three imperfect information games.

2. Avalon is also an imperfect information game.

A: Thanks for the recommendation. The recommendation papers are contemporaneous works, and they are not published, so we do not include them in our original submission, but we are happy to include the detailed discussion with [1,2] in our camera-ready paper.

3. Related Work has to be moved to the main content.

A: We will include it in our camera-ready version with a 1 page extension.

4. None of works on playing imperfect information games with LLMs is mentioned in introduction.

A: In the original submission, our work is indeed the first one, and are contemporaneous works with [1,2]. We would like to delete the claim in our camera-ready version if it makes the paper solid.

5. Compared with existing work, the method in work is incremental.

A: As the pioneering work to design the prompting system to enable LLM for poker games, we still believe our proposed methods and the preliminary public experimental study and data are valuable for the community.

a. We experimentally demonstrate that even with pre-trained data, LLM itself cannot play various imperfect information games well as Table 3 shows ((-72 no ToM vs +24 second-order ToM). Then we introduce the prompting system to enable large language models like GPT-4 to play various imperfect information games using only the game rules and observations without any extra training, this may inspire more LLM-based subsequent works.

b. We comprehensively identify and discuss the current limitations of employing large language models in imperfect information games, contributing valuable insights for further AI-agent research.

c. We are preparing to make all our code and interactive data publicly available. We hope this will catalyze the development of more efficient models in the field of imperfect information games.

[1] Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation.

[2] Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game.

评论

Thank you to the authors for their rebuttal. My main comments are as follows:

  1. From the perspective of when the paper was first submitted to arXiv, the insights provided by this paper were valuable and significant to the field. If I were evaluating the paper at that time, I would have a relatively positive opinion.
  2. From the current perspective, many papers have already addressed the same or similar issues as this paper. Compared to these papers, the innovation and contribution of the current version of this paper have been significantly weakened. Without considering the historical context of this paper, I would hold a negative evaluation.

If the chairs makes a decision based on the first point mentioned above, I will not oppose it. If the chairs evaluate based on the second point, I also find it reasonable. Therefore, I will leave it to the chairs to decide whether to judge based on historical contribution or the current objective status.

Additionally, I suggest improving the writing of the paper, such as addressing issues in the related work section.

审稿意见
5

This manuscript provides an agent for imperfect information game based on the Theory of Mind. The authors demonstrate the effectiveness of the proposed methods on Leduc Hold’em and make extensive analysis for the proposed method and its interaction with a wide range of baseline methods.

接收理由

  • The manuscript is easy to follow.
  • The analysis is complete.

拒绝理由

  • It’s unclear how the baseline policy is obtained. Typically the strength of the the policy heavily depends on the number and the diversity of samples they collect during training (even for CFR+), and it is unclear from the current manuscript if the baseline has reached their limit,
  • It seems that the authors have some mis-understanding on the game theory. There is no point to compute the average in Table 2, as the agent can suffer significant loss with other types of players/other game instances. We can only argue that one opponent can beat another opponent in some game instances, and one opponent can beat another opponent if the game instance is sampled from some certain distribution. Similarly for Table 4. Win rate is also not a meaningful criterion. What we really do in game theory is to test the exploitability of the policy, which should be simple for the game with small scale like Leduc Hold’em.

给作者的问题

  • Can you provide detailed information on the baseline training?
  • Can you further test the exploitability of each baseline method?
作者回复

Thank you for your time and effort in sharing critical feedback regarding our work. We provide the following response to address your concerns about results limitations, and thus respectfully hope you can consider the response to the final decision.

1. Can you provide detailed information on the baseline training?

A: Thanks. We trained our baselines using the default settings of RLCard (https://github.com/datamllab/rlcard/) and continued training them until convergence (check one by one). For example, we trained CFR+ (used in our experiments) for 2,000,000 epochs and observed convergence at around 1,000,000 epochs. Similarly, we trained DMC for 1,000,000 episodes, noting convergence at approximately 600,000 episodes, and DQN for 100,000 episodes. Therefore, the baselines used in our experiments have already reached their performance limits.

2. Can you test the exploitability of each method?

Thank you very much for your time and effort.

First, we agree with you that randomness and complexity play a significant role in imperfect information game evaluation, and some methods may perform well in some instances, but poorly in others. Because of the limited time and the expensive cost of calling GPT-API, we do not do exploitability evaluation, but we designed two types of experiments to compare Suspicion Agent with baselines and remove game instance randomness:

Reducing the randomness of Random Seeds:

Reducing the randomness of Card Strength/Position: (which is also used in the AIVAT for imperfect information game evaluation [1])

Given these two designs, we believe that we can alleviate the effect of randomness in imperfect information games. Given the results in Table 1 and Table2 in the main paper: Suspicion Agent achieves superior performance over learning-based baselines except CFR+ in all 100 random games, 50 games in position 0 and 50 games in position 1. They are the clear evidence to show the advantages of our suspicion agent.

Lastly, we would like to highlight that our study is the first to consider the interaction between two agents by introducing the theory of mind using LLM. We respectfully hope that you reconsider your decision in light of these contributions and the innovative aspects of our study.

[1] Burch, Neil, et al. "Aivat: A new variance reduction technique for agent evaluation in imperfect information games." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.

评论

Thanks for the clarification on the baseline method. I definitely acknowledge the novelty of the proposed method but still think exploitability is the golden standard for two-player zero-sum games and will provide more information and diagnosis on the results. I believe more informative results can be given with the exploitability (e.g. effects of each component) and hence will further improve the overall quality.

审稿意见
9

This paper apply LLM to building a poker playing game. It leverages prompt engineering to baked in the mechanimsm of theory of mind during the planning for each action. The authors conduct experiment on a simplied version of Texas hold'em and shows potential signal of using LLM's reasoning power in the imperfect information games.

接收理由

This is a very solid paper, and applying LLM with planning into poker is novel as far as I am concerned. The proposed method lays a good foundation for this line of research, and I imagine can be used as a solid baseline. The experimental results are comprehensive with different baseline methods. It's also interesting to see the ablation study that improve the understanding.

拒绝理由

I really like this paper, and I think it would be a good contribution to the conference.

给作者的问题

  1. Can you try more open sourced model, e.g., maybe trying to see how scaling helps a model becomes a better poker player?
  2. You idenfiy couple of important ability in the appendix "Gap between GPT3.5 and 4". One of them is Long Prompt Handling. As we all know there many recent attempts on increasing context length. It would be interesting to benchmark the performance of LLM playing poker vs their performance on long context handling, e.g., lost in the middle benchmark
作者回复

Thank you for your time and effort in sharing critical feedback regarding our work. We appreciate your encouraging feedback.

1. Can you try more open sourced model?

A: Thanks. We tried more open-sourced models

DMCDQNNSFPCFR+
GPT-42445142-22
Mistral v2 7B2333-10-66
Gemma26810-72
LLama3-8B54-67-61-103
Phi3-medium-14B825736-14

We indeed find that recent open-sourced models demonstrate strong instruction-following ability and thus can achieve better performance than before.

2.You idenfiy couple of important ability in the appendix, "Gap between GPT3.5 and 4". One of them is Long Prompt Handling.

A: Thank you very much for the encouraging comment. As a pioneering work that applies LLMs into unseen tasks especially in imperfect information games, we believe our results, exploration and public dataset would inspire more subsequent works.

审稿意见
7

This paper investigates LLM performance with imperfect information games without additional training or data, and it proposes Suspicion-Agent. To tackle challenges of opponent’s uncertainty, the authors propose theory of mid (ToM) aware planning approach. The authors conduct comprehensive evaluations with GPT4.

接收理由

  • The angle of imperfect games for LLM evaluation is important and practical to various applications.
  • The proposed approach has certain novelty. The empirical experiments are comprehensive and lead to interesting observations.

拒绝理由

  • The paper only explores first-order and second-order prompts for planning. In practice, applications or games could require much more rounds of planning. Some discussions and tradeoff would be needed with n-order planning (n≥2).
  • The ToM planning strategy seems to be similar to existing works for LLM game playing, like GTBench [1]. The authors need to clarify the differences to similar works and make novelty clear.

[1] GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

给作者的问题

  • What kind of imperfect applications can the proposed framework generalize to?
作者回复

Thank you for your time and effort in sharing critical feedback. We appreciate your encouraging feedback.

1. Some discussions and tradeoff would be needed with n-order planning (n≥2).

A: It is a good point. The necessity of constructing or utilizing third-order (or higher) Theory of Mind (ToM) in game-playing scenarios has been a topic of discussion in previous research [1,2]. These studies suggest that ToM beyond the second-order tends to become less effective. This reduced effectiveness is primarily attributed to the increased complexity involved in calculating third-order (or higher) ToM. This increase in complexity can lead to diminishing returns in terms of strategic advantage, especially in practical gaming scenarios. Players or AI agents attempting to anticipate an opponent's beliefs about another player's beliefs (which is the essence of third-order ToM) often face considerable challenges in accurately processing and utilizing this information effectively. Thus, while higher-order ToM may provide theoretical insights into more complex strategic thinking, its practical application in game-playing contexts is limited by computational and cognitive constraints.

2. The ToM planning strategy seems to be similar to existing works for LLM game playing

A:

Thank you for the recommendation. Our paper and GTBench are contemporaneous works, and we are happy to include the following discussion in our camera-ready paper:

Contemporaneous work GTBench proposes a benchmark to systematically evaluate the performance of LLMs on various imperfect information games using existing prompting techniques. In contrast, we have developed a specific prompt system tailored for imperfect information games, which enables GPT-4 to outperform certain learning-based algorithms such as DMC and NFSP, even without specialized training or examples.

3. What kind of imperfect applications can the proposed framework generalize to?

A: Good point. Currently, our Suspicion-Agent focuses on text-based imperfect information games, requiring game rules and game status interpretation rules as input. Therefore, it cannot yet generalize to multi-modal imperfect information games.

[1] De Weerd, etc. Higher-order theory of mind in negotiations under incomplete information. *PRIMA 2013: Principles and Practice of Multi-Agent Systems

[2] De Weerd, etc. How much does it help to know what she knows you know? An agent-based simulation study. Artificial Intelligence, 199, 67-92.

最终决定

The paper considers LLM agents to play imperfect information games and their ability to display theory-of-mind capability. In addition to the novel ideas, the paper also displays extensive experiments with the game of Poker and show favorable performance of their method vs. other previously considered agents. I find the contribution to be novel and important, clear accept.