6.4

/10

Poster4 位审稿人

最低3最高5标准差1.0

3.5

置信度

创新性2.5

质量2.8

清晰度3.5

重要性3.0

NeurIPS 2025

Multi-agent KTO: Enhancing Strategic Interactions of Large Language Model in Language Game

Rong Ye,Yongxin Zhang,Yikai Zhang,Haoyu Kuang,peng sun,zhongyu wei

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

We introduced Multi-agent KTO, a method that trains LLM to play Werewolf through direct gameplay. Our approach outperforms GPT-4o and RL+LLM methods, achieving human-competitive performance.

摘要

关键词

Large Language ModelLanguage GameApplication

评审与讨论

审稿意见

评分: 5置信度: 42025-06-25

This paper introduces MaKTO (Multiagent Kahneman & Tversky’s Optimization), a framework for training LLMs via agent-play in Social Deduction Games (SDGs) without paired data. Inspired by Wittgenstein’s language-game theory, MaKTO first behavior-clones on a self-collected, large-scale Werewolf dataset of expert utterances and chain-of-thought annotations. It then engages diverse models to generate unpaired desirable and unacceptable gameplay samples, using KTO to sharpen strategic decisions. Extensive evaluations including AI tournaments, human-AI matches, Turing-style detectability tests, generalization trials, and ablations, demonstrate MaKTO’s strength (an average 61% win rate in Werewolf games, outperforming GPT-4o and two-stage RL) and its human-level play.

优缺点分析

Strengths

Multi-agent training via diverse models is a clever adaptation of preference learning that avoids needing paired comparisons or hard labels. KTO leverages these unpaired samples to concurrently refine both decision policies and natural language outputs.
The authors construct a rich expert-annotated dataset with chain-of-thought rationales. The complexity of the Werewolf game (deception, role-playing, and partial observability) provides a strong testbed for generalizable interactive intelligence.
The paper validates MaKTO across automated AI tournaments, human-AI matches, Turing-style indistinguishability tests, transfer to novel game variants, and detailed ablations.
Quantitative gains over GPT-4o and Claude-3.5-Sonnet are non-trivial.
Win-rate against expert players and Turing-test-style indistinguishability trials confirm MaKTO’s behavior is both strategically sound and convincingly human-like.

Weaknesses

Experiments are confined to a single game (Werewolf), questions about generalization to other strategic or conversational settings
The multi‐agent play and KTO refinement loop may entail significant computational cost and large sample requirements; these overheads are not fully characterized
The choice of which agents participate in multi-agent play (GPT-4o, two-stage RL, etc.) can heavily influence the quality of "desirable" vs. "unacceptable" samples. The paper doesn’t study how performance varies if you swap in newer LLMs (Gemini-2.5-Pro, o4-mini, o3-mini, DeepSeek-R1, Qwen3, etc.) or change the pool composition.

问题

How does MaKTO’s performance vary if you swap in newer LLMs (e.g. Gemini-2.5-Pro, o4-mini, o3-mini) or remove GPT-4o from the agent pool? Is there a clear relationship between the diversity/strength of the pool and final outcomes?
Your codebase appears to support both Chinese and English prompts. Could you clarify whether the behavior cloning, multi-agent play, KTO refinement, and evaluations were conducted in Chinese-only, English-only, or bilingually? Does the choice of language impact MaKTO’s strategy learning, win rates, and detectability across models with different language proficiencies?
Is it possible to report the performance results for (a) behavior cloning only, (b) KTO-only fine-tuning from random initialization, and (c) the full two-stage pipeline to quantify each stage’s individual contribution to the final win rate ?

局限性

Yes, the authors have adequately addressed the limitations.

最终评判理由

The authors added structured ablation studies and clear explanations regarding generalization, training stages, and agent pool composition, which enhance the clarity and completeness of the paper.

格式问题

The paper is well-formatted.

作者回复

2025-07-31

Thank you for your valuable feedback and insightful comments on our paper. We truly appreciate the time and effort you dedicated to reviewing our work. We address your comments in detail as below.

Q1: Generalization Beyond Werewolf game.

A1: We appreciate this insightful comment on the generalization of our approach. In this work, we demonstrate that through SFT and Multi-agent KTO, LLMs can effectively learn the complex strategies required for the Werewolf game. We believe that social deduction games share significant similarities in their underlying strategic spaces. As mentioned in Appendix J, we hypothesize that with minor fine-tuning, our LLM-based agent can effectively generalize to other games, such as Avalon. We aim to validate this in future work.

Additionally, we plan to include experimental results from more complex game settings, such as 10-player games with Guard/Hunter and 12-player games, to demonstrate the generalization capability of our approach across different player numbers. The detailed results can also be found in our response to Review 3tA5's Q&A-1.

Q2: On computational cost: the multi-agent play and KTO refinement loop may entail significant computational cost and large sample requirements. These overheads are not fully characterized.

A2: Thank you for raising this important detail. Taking the largest scale training (MaKTO-72B model) as an example: the training involved 289 9-player Werewolf games. To ensure effective sampling, each game included at least three SFT-72B models. The remaining participants in these games included GPT-4o-mini (180 appearances), GPT-4o (180 appearances), Claude-3.5 (180 appearances), Mix model (162 appearances), and Qwen2.5-SFT-14B (296 appearances). This process yielded 20,674 MaKTO training samples, comprising 12,225 desirable data samples (59.1%) and 8,449 unacceptable data samples (40.9%). Overall, the API calls were affordable.

Q3: Impact of Agent Pool Diversity and Strength: how MaKTO's performance varies if newer LLMs are swapped into the pool or if GPT-4o is removed, and whether there's a relationship between pool diversity/strength and final outcomes.

A3: This is an excellent suggestion for an ablation study. We conducted an experiment where we removed GPT-4o from the agent pool and trained a MaKTO-14B-w/o GPT4o model for comparison. The head-to-head win rates are shown below:

	GPT-4o	SFT-14b	MaKTO-14b	MaKTO-14b-w/o GPT4o	Avg.
GPT-4o	0.50	0.44	0.37	0.41	0.430
SFT-14b	0.56	0.50	0.48	0.51	0.513
MaKTO-14b	0.63	0.52	0.50	0.51	0.540
MaKTO-14b-w/o GPT4o	0.59	0.49	0.49	0.50	0.518

From these results, we observe that MaKTO-14B, trained with GPT-4o in its agent pool, achieves a higher average win rate (0.540) compared to MaKTO-14B-w/o GPT4o (0.518). This suggests that the presence of stronger, more diverse agents like GPT-4o in the training pool contributes to learning more robust and effective strategies, leading to better overall performance for MaKTO. We also hypothesize that the diversity and strength of the agent pool has a positive correlation with the final outcomes of MaKTO model. Your suggested ablation study is very interesting. Our current experiment of simply removing GPT-4o is preliminary. We plan to conduct more comprehensive experiments, such as controlling the number of agents in the pool and comparing performance across different model sizes.

Q4: How language choice impacts strategy learning, win rates, and detectability.

A4: All behavior cloning, multi-agent play, KTO refinement, and evaluations are conducted in Chinese-only. While the experiments were in Chinese, our human evaluators did not report any issues with GPT-4o's fluency in Chinese. The detectability of GPT-4o in the Turing-style test was primarily due to its behavioral patterns and voting strategies, rather than linguistic proficiency. As described in Line 210, GPT-4o's speaking style was often overly conservative, hesitant to make bold accusations even when the situation seemed clear to expert human players. Its voting behavior also differed, frequently leading to abstentions or votes for non-critical roles, unlike human players who would typically take a decisive stance in crucial moments (e.g., when two players claim to be the Seer). These strategic and behavioral differences were the main factors influencing its detectability. Detailed behavior of GPT-4o can also be found in the response to Reviewer uAsG Q&A-3.

Q5: Performance ablation of different training stages.

A5: Thank you for this excellent suggestion to quantify the contribution of each stage. We can provide a clear breakdown of the performance gains from our Behavior Cloning (SFT) stage and the subsequent MaKTO (SFT + KTO refinement) stage, compared to the base instruction-tuned model.

Here are the head-to-head win rates against various agents:

	GPT-4o_mini	GPT-4o	Claude	SFT-14b	KTO-14b	Qwen2.5-Instruct-72b	SFT-72b	KTO-72b	Avg.
GPT-4o_mini	0.5	0.44	0.23	0.23	0.14	0.47	0.24	0.12	0.296
GPT-4o	0.56	0.5	0.66	0.48	0.37	0.78	0.4	0.35	0.513
Claude	0.77	0.34	0.5	0.48	0.48	0.95	0.44	0.38	0.543
SFT-14b	0.77	0.52	0.52	0.5	0.48	0.79	0.57	0.49	0.580
KTO-14b	0.86	0.63	0.52	0.52	0.5	0.89	0.51	0.42	0.606
Qwen2.5-Instruct-72b	0.53	0.22	0.05	0.21	0.11	0.5	0.2	0.1	0.240
SFT-72b	0.76	0.6	0.56	0.43	0.49	0.8	0.5	0.42	0.570
KTO-72b	0.88	0.65	0.62	0.51	0.58	0.9	0.58	0.5	0.653

From the table, we can clearly see the contributions:

Base Model (Qwen2.5-Instruct-72B): As a raw instruction-tuned model, it shows an average win rate of only 24%, indicating a significant gap when playing against API-based models and our fine-tuned 14B models.
Behavior Cloning (SFT-72B): Training with our collected expert data via SFT significantly boosts performance. The SFT-72B model achieves an average win rate of 57%, representing a 33% improvement over the base Qwen2.5-Instruct-72B model (0.570 - 0.240). This demonstrates the critical importance and quality of our collected expert dataset, enabling the SFT model to even outperform API-based models like GPT-4o and Claude in head-to-head matches.
MaKTO Refinement (MaKTO-72B): Building upon the strong SFT baseline, our MaKTO algorithm further refines the model's capabilities. The KTO-72B model achieves an average win rate of 65.3%, which is an 8.3% improvement over the SFT-72B model (0.653 - 0.570). This demonstrates that even on top of high-quality behavior-cloned data, our KTO-based refinement loop provides substantial additional gains, contributing to the overall superior performance.

2025-08-03

Thank you for the authors' responses. I appreciate the structured ablation studies and the clear clarifications regarding generalization, training stages, and agent pool composition. These responses address most of my concerns and help improve the clarity and completeness of the paper.

My overall rating remains at 5, but I have raised the clarity score to reflect the improved explanation and presentation in the rebuttal. Please ensure you will include these results and observations in your paper.

2025-08-04

Thank you so much for your prompt and positive response. We are happy to hear that our rebuttal has addressed your concerns, and especially grateful for your decision to raise the clarity score – this is encouraging for us. :)

审稿意见

评分: 5置信度: 32025-07-01

This paper introduces Multi-agent KTO, a method designed to enhance the performance of LLM in the Werewolf game. MaKTO leverages multi-agent interactions and stepwise preference optimization to improve the strategic reasoning and social interaction capabilities of LLMs. The approach involves fine-tuning the models with expert data, generating desirable and undesirable responses, and refining decision-making processes using the KTO algorithm.

优缺点分析

Strengths:

The paper proposes a method to enhance LLMs' performance in social deduction games and offers robust validation of its capabilities, including performance tests against baselines and human players, as well as detailed behavioral analysis.
Create a Werewolf dataset that includes expert utterances and actions.
Good writing, with a clear and logical structure that makes the paper easy to read.

Weaknesses:

Generalization: The design of the "think-before-respond" dataset format and the stepwise preference data selection both incorporate a large amount of human prior knowledge, which limits the generalizability of the approach. It remains unclear whether its effectiveness can be directly transferred to other types of social deduction games or broader application scenarios.
Limitations of the Training Method: MaKTO offers a simple and effective training paradigm. However, the selection methods in Table 10 could also apply to online MARL. Whether online learning can discover effective strategies different from those of experts might be a more interesting topic for Social Deduction Games.

问题

Is the performance of GPT-4o and its results in the Turing-style detectability test possibly attributable to the choice of language (Chinese)?
Did the authors experiment with other open-source foundational LLMs? If so, to what extent did the capabilities of these foundational LLMs influence the performance of MaKTO?

局限性

No comments

最终评判理由

The authors' response has addressed most of my concerns. This work provides comprehensive experimental validation, demonstrating that the MaKTO framework achieves performance comparable to that of high-level human players. Its dataset construction methodology and multi-agent setting offer valuable insights for training AI in social deduction games. I will raise my score accordingly.

格式问题

No comments

作者回复

2025-07-31

Thank you for your valuable comments on our paper. We truly appreciate the time and effort you dedicated to reviewing our work. We will address your specific comments and concerns below.

Q1: Generalization Limitations Due to Human Prior Knowledge. It is unclear whether it can be directly transferred to other types of SDGs.

A1: We appreciate this insightful comment on the generalization of our approach. The "think-before-respond" format in our dataset indeed embeds fundamental strategic insights for social deduction games. Our experiments demonstrate that by training on this dataset, the LLM effectively learns these core strategies and develops strong reasoning abilities specifically for Werewolf. We posit that social deduction games share substantial similarities in their strategic spaces. Therefore, we believe that with minor fine-tuning, our approach could effectively generalize to other SDGs, such as Avalon. We plan to validate this in future work.

Additionally, we plan to include experimental results from more complex game settings, such as 10-player games with Guard/Hunter and 12-player games, to demonstrate the generalization capability of our approach across different player numbers. The detailed results can be also found in our response to Review 3tA5's Q&A-1.

Q2: Applicability of Selection Methods to Online MARL and Discovery of Novel Strategies: the selection methods in Table 10 could apply to online MARL and whether online learning could discover strategies different from human experts.

A2: We agree these are highly relevant points. First, directly optimizing LLMs with online Reinforcement Learning (RL) in a multi-agent, dynamic, and long-utterance conversational environment is non-trivial and computationally demanding. To our knowledge, no existing work has achieved direct LLM optimization in such complex multi-party dialogue settings.

Second, we did incorporate an online RL-based baseline: the Mix model[1] -- a Cicero-like policy model, which combines Online RL policy model with an LLM as an expressor (i.e. RL+LLM). In the policy model of Mix, both utterances and actions are formatted into structured inputs (e.g., "Player X labels Player Y as Z identity"). Its reward function incorporates both the ultimate game outcome (win/loss) and round-based correct actions (e.g., correctly voting for a werewolf, seer identifying a werewolf, similar to the heuristic-based rewards in our Table 10). It uses an online MARL training approach, similar to AlphaStar. The LLM then serves as an expressor to articulate the learned policies into natural language utterances. As shown in Table 1 and Figures 3-6, MaKTO consistently outperforms the Mix model on average.

Regarding the question --- "Whether online learning can discover effective strategies different from those of experts," we found an interesting observation in Figure 3 (we simply described in lines 169-171). The strategies learned through online learning do indeed differ from human expert strategies. While the Mix model achieved a strong head-to-head win rate as a Villager, its strategy as a Werewolf was often overly aggressive, leading to a significantly lower Werewolf win rate. For example, we observed a pattern where all three Werewolves would simultaneously claim to be the Seer in one round of day-time discussion. In such a case, if the true Seer gained the trust of most villagers, the identities of all three Werewolves would immediately be exposed.

Due to space constraints, our detailed description of the Mix model's training was brief (only Line 160 in the main text now). We will add a more detailed description of this Online RL + LLM algorithm in the Appendix. Directly applying MARL methods to optimize LLMs remains a challenging research topic that we are actively exploring, though it falls beyond the primary scope of this paper.

Q3: Whether GPT-4o's performance in the Turing-style detectability test is attributable to the choice of the Chinese language.

A3: While our experiments are conducted in Chinese, the performance of GPT-4o and its detectability in the Turing-style test are not due to linguistic fluency issues. Based on feedback from our human players, GPT-4o's utterances are fluent and natural in Chinese. The detectability is more from its distinct **speaking style and voting behavior(( compared to human players (as noted in Line 210).

Specifically, in terms of speaking style, GPT-4o tended to be overly cautious. Even in situations where, to an experienced human player, the identity of a Werewolf was quite obvious, GPT-4o would often refrain from boldly accusing a player, preferring to state the need for further observation before confirming a suspicion. Regarding voting behavior, GPT-4o often abstained from voting or voted for a non-critical role. For example, if two players simultaneously claimed to be the Seer (a critical situation where human players would typically vote for one of them to express their stance), GPT-4o might abstain due to uncertainty or vote for another player it suspected, rather than engaging directly with the crucial "Seer" claims. These behavioral differences, not language fluency, were the main factors in its detectability.

Q4: Experiments with other open-source LLMs and Influence of foundational model capabilities.

A4: Yes, in the early stages of our experiments, we did SFT train based on Llama-3.1-8B/70B-Chinese-Chat models [2]. Here are the results:

	GPT-4o	Mix	Qwen2.5-14b-SFT	Llama3.1-8b-SFT	Qwen2.5-72b-SFT	Llama3.1-70b-SFT	avg.
GPT-4o	0.5	0.56	0.44	0.44	0.4	0.38	0.453
Mix	0.44	0.5	0.58	0.67	0.45	0.59	0.538
Qwen2.5-14b-SFT	0.56	0.42	0.5	0.62	0.57	0.48	0.525
Llama3.1-8b-SFT	0.56	0.33	0.38	0.5	0.49	0.47	0.455
Qwen2.5-72b-SFT	0.6	0.55	0.43	0.51	0.5	0.5	0.515
Llama3.1-70b-SFT	0.62	0.41	0.52	0.53	0.5	0.5	0.513

We observe that, for the same SFT, Qwen performed slightly better than Llama-Chinese models in the Chinese context. This observation played a key role in our decision to select Qwen as our base model.

Furthermore, we find that larger model sizes significantly influenced performance. Larger models inherently has stronger instruction-following capabilities and fewer hallucinations. These are crucial for complex game dynamics. For instance, if a Villager model hallucinates or provides incorrect contextual information during its speech, it's more likely to be perceived as a Werewolf, directly reducing the Villager team's win rate. The enhanced reliability and contextual accuracy of larger foundational models therefore directly contributed to the overall performance improvements we observed.

[1] Wu, Shuang, et al. "Enhance reasoning for large language models in the game werewolf." arXiv preprint arXiv:2402.02330 (2024).

[2] https://huggingface.co/shenzhi-wang/Llama3.1-8B-Chinese-Chat

2025-08-05

Thank you for your thorough clarifications. Your additional explanations regarding the discovery of novel strategies and the selection of open-source LLMs are very helpful. I have also read the comments from other reviewers and the authors' responses. I will finalize my decision after further discussions with other reviewers and AC.

2025-08-05

Thank you for your prompt reply. We are glad that our clarifications are helpful. We appreciate your time and consideration. :)

评论- Follow-up on Rebuttal for Submission 11638

2025-08-04

Dear Reviewer uAsG,

Apologies for the direct message, and thank you again for your valuable and detailed review of our paper.

We are writing to follow up on the rebuttal we submitted last week. In our response, we have included clarifications on the generalization, comparison with the Online RL method and explanation on GPT-4o's behavior in the detectability test. We also provided experimental results of finetuning other open-source LLMs (Llama3.1) We hope our response has sufficiently addressed your concerns.

As the discussion period deadline is approaching, we would be very grateful for the opportunity to discuss any final questions you may have. We are available to provide any further clarification needed. Thank you!

Best regards,

Submission 11638 Authors

审稿意见

评分: 3置信度: 42025-07-02

This paper proposes multi-agent KTO to train LLMs to play Werewolf. The authors provide a human dataset from gameplay for SFT, and then use KTO from multi-agent gameplay to train the LLM. The resulting agent achieves strong performance against commercial LLMs and existing methods, and performs comparably to humans.

优缺点分析

Strength

Interesting topic: this paper studies an interesting topic of social deduction games Werewolf. This game involves strategic reasoning and decision-making in multi-agent interaction and has become one of the most popular language game testbed.
Good performance: this paper provide extensive experiment, including head-to-head tournament, human-AI experiments, Turing-style test, to demonstrate the good performance of the proposed agent.

Weakness

Minor novelty in algorithm: the main designs in the propose method, MaKTO, have little innovation compared to existing methods. More specificall, the algorithm is introduced in section 2.3:
1. Kahneman-Tversky Optimization (L113-123): the algorithm KTO itself is a well-established method, and this work simply uses this algorithm in multi-agent games without any novel design for multi-agent settings (at least in L113-123).
2. Multi-agent Gameplay (L124-133): one key design in MaKTO is to use population play instead of self-play to introduce diverse policies. However, this is a consensus in the MARL community [1,2] and is common practice in work on LLM agents [3], which is also of little novelty.
3. Stepwise Preference Data Selection (L134-149): another key design is the step reward, or in standard RL/MARL terms, this is the problem of credit assignment. The current method mainly address the credit assignment between steps for a single agent. In multi-agent setting, another challenge is the credit assignments between multiple agents. Since the outcome of the game depends on the joint action of all agents, the reward for a single agent's action not only depends on its own policy, but also on the policies of other agents. Therefore, determining the reward without consideration of other agents' policies is suitable for single-agent tasks, but not for multi-agent tasks. Moreover, an action is not just acceptable or not acceptable (binary reward like 0/1), it's also about how desirable the action is (scalar reward in [0, 1]), which is also not addressed in the method.
4. In summary, given the name Multi-agent KTO, I would expect novel algorithm design for KTO in multi-agent tasks beyond common practice in existing literature, instead of simply applying this well-established method in multi-agent settings.
Ambiguous motivation (L16-22, Figure 1): this work is motivated by the insight from Wittgenstein’s Language Game Theory. However, these insights and theory are mentioned but not explained. The difference between the language theory in Fig. 1(a) and (c) is only explained in two sentences and is hard to fully understand. Also, he connection between the language theory and the framework feels somewhat forced and not well-explained. For example, why does "rules" in Fig 1(c) correspond to "intention" in Fig 1(d)?
Comparison with well-known method DPO: although the authors argue in L119-120 that it is nontrivial to get "prompt-chosen-reject" paired data for DPO, it seems quite natural to apply DPO in the same framework: for each prompt, we can generate multiple responses, and apply the same stepwise preference data selection to determine the chosen and rejected response for DPO. It would be more convincing if the authors could provide experiments with explanations to validate that KTO is better than DPO in this case.
Some related works: [4,5] also studies werewolf games and agents. [6] also directly fine-tune LLMs for another strategic game Diplomacy.

[1] Vinyals, Oriol, et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." nature 575.7782 (2019): 350-354.

[2] Czarnecki, Wojciech M., et al. "Real world games look like spinning tops." Advances in Neural Information Processing Systems 33 (2020): 17443-17454.

[3] Meta Fundamental AI Research Diplomacy Team (FAIR)†, et al. "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." Science 378.6624 (2022): 1067-1074.

[4] Shibata, Hisaichi, Soichiro Miki, and Yuta Nakamura. "Playing the Werewolf game with artificial intelligence for language understanding." arXiv preprint arXiv:2302.10646 (2023).

[5] Xu, Zelai, et al. "Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization." arXiv preprint arXiv:2502.04686 (2025).

[6] Xu, Kaixuan, et al. "DipLLM: Fine-tuning LLM for strategic decision-making in Diplomacy." arXiv preprint arXiv:2506.09655 (2025).

问题

My questions are discussed in the weakness part in detail. To briefly summarize:

The novelty in the proposed method MaKTO.
Explanation for motivation.
Comparison with DPO.

Another minor comment is that the title "Multi-agent KTO: Reinforcing Strategic Interactions ..." would make the readers expect a reinforcement learning-based method or algorithm, while the proposed method is clearly not. I think replacing "reinforcing" with other words like "enhancing" would eliminate any unnecessary misunderstandings.

局限性

yes.

最终评判理由

For Q1, I think the authors may not fully get my point. To be clearer, my question is: what is the algorithm novelty of MaKTO compared to KTO, beyond applying it to multi-agent games? I agree that MaKTO is an end-to-end method that directly optimizes LLMs, but KTO itself is also an end-to-end training method, which diminishes the authors' claimed core contribution. From KTO to MaKTO, one would expect novel algorithm designs to address the challenges in multi-agent settings. As I have discussed in my review, the main designs of the proposed methods discussed by the authors are: (1) applying KTO to multi-agent settings (2) diverse gameplay, which is common practice (3) stepwise selection or reward shaping in RL terms. These designs do not provide new insights on how to train LLMs in multi-agent settings beyond applying existing single-agent methods. This is my major concern.

For Q2, I think the new explanation makes the motivation much clearer and has addressed my previous concern.

For Q3, I understand that it's hard to apply DPO directly in long-horizon games. But MaKTO is also not directly optimizing the full-length trajectory. Instead, the method uses stepwise preference data selection to optimize in one turn. As I discussed in my review, we can apply the same stepwise preference data selection to determine the chosen and rejected responses for DPO. So I don't think the authors' arguments for exponential states in long-horizon games hold for my question in the review. And I'm expecting experiment results that only replace KTO with DPO in MaKTO.

Q4 and Q5 are fully addressed.

In summary, my main concern on minor novelty beyond applying KTO in multi-agent settings, and comparison with DPO is not fully addressed. And I'll keep my score.

格式问题

The section titles are not in initial caps/lower case.

作者回复

2025-07-31

Thank you for your valuable feedback and insightful comments on our paper. We truly appreciate the time and effort you dedicated to reviewing our work. Here are the responses to your concerns.

Q1: Minor novelty in algorithm.

A1: We appreciate your detailed assessment of our algorithmic novelty. We believe our core contribution and novelty lie in proposing an end-to-end training paradigm that unifies natural language communication and action decision-making within the language space, directly optimizing LLMs for complex social deduction games. Prior works on Werewolf, including those you cited [3, 4, 5], typically decouple these two aspects: using LLMs for communication (natural language action space) and a separate policy network for action decision-making (formalized action space). This decoupling often introduces a semantic gap between communication and action. In contrast, our approach significantly mitigates this gap by integrating both communication and action within a unified natural language space. To the best of our knowledge, our work is among the first to propose and implement such an end-to-end paradigm for this class of social deduction games.

Furthermore, we would like to clarify that we view MaKTO not as a simple application of KTO, but as an extension of KTO. While KTO was originally designed for preference alignment based on human feedback, we extend its principles to the much more complex domain of multi-agent strategic language games. MaKTO demonstrates how the core idea of KTO can be generalized beyond simple alignment tasks to reinforce strategic reasoning in interactive, language-based environments. We believe that this extended framework is a valuable contribution and holds promise for other multi-agent tasks.

Q2: Ambiguous Motivation from Wittgenstein's Language Game Theory, specifically question the mapping between "rules" in Figure 1(c) and "intention" in Figure 1(d).

A2: We apologize if our initial explanation of the motivation was unclear. A crucial insight from Wittgenstein's Language Game Theory is the unity of language and behavior: language games aren't just about speaking; they encompass language interwoven with various actions. Language functions through two-way interaction between people and with the world. This directly informs MaKTO's design: we aim to unify the modeling and optimization of both behavior (in-game actions) and communication (utterances) and to train our model end-to-end by interacting with diverse agents to receive varied feedback. This holistic approach is a key differentiator from prior works [3, 4, 5].

In contrast, previous work, which often decouples intentions and language for modeling, aligns more with Wittgenstein's earlier "picture theory" (Fig1a) where language is merely a representation—in this game context, a representation of intention. Regarding Fig.1, our intention is to convey that language, behavior, rules, intentions ... are interconnected components that should be optimized holistically from interactive data. There isn't a direct one-to-one mapping between "rules" in (c) and "intention" in (d). We will refine the introduction section to ensure our contribution, novelty, and motivation are clearer to readers.

Q3: Comparison with DPO.

A3: We understand the reviewer's interest in a comparison with DPO. While it might seem straightforward to construct "prompt-chosen-reject" pairs for DPO, doing so effectively for complex multi-agent, long-horizon games with with sequential decision-making like Werewolf is non-trivial. For example, in a single "night action - discussion - vote" cycle in a 9-player game, there are typically 21 decision turns. Even sampling just two responses per turn would lead to $2^{21}$ possible preference expansions for that single cycle. A full game averagely spans three such night-day cycles, resulting in a staggering $2^{63}$ potential outcomes to enumerate and construct preferences from, which is computationally infeasible. The impact of a single-turn preference of one single agent player can be easily mitigated over a long game sequence, leading to inefficient sampling when creating DPO training data.

Despite this challenge, we followed your suggestion and conducted an experiment with a simplified DPO setup. While heuristic-based construction of DPO training data is simpler for atomic actions (e.g., "Villager voting for a werewolf" as chosen vs. "Villager voting for another villager" as reject), it's much harder for the speech in natural language space. Nevertheless, we attempted to construct some DPO training data for speech data. For staged vote-based speech (e.g., "Werewolf speaking and being voted out"), we selected the utterance of a Werewolf who was not voted out later in the game as the "chosen" utterance. For verifier-based speech, we modified numbers in utterances that conflicted with observable facts in the gameplay to create "rejected" samples.

With such setup, we trained a DPO-14B model and compared its performance with GPT-4o, SFT, and MaKTO. The results are as follows:

	GPT-4o	SFT-14b	MaKTO-14b	DPO-14b	Avg.
GPT-4o	0.5	0.44	0.37	0.48	0.448
SFT-14b	0.56	0.5	0.48	0.5	0.510
MaKTO-14b	0.63	0.52	0.5	0.51	0.540
DPO-14b	0.52	0.5	0.49	0.5	0.503

From the table, you can see that DPO's win rate is not as good as MaKTO's. This is because Werewolf demands highly nuanced language capabilities, and constructing effective "prompt-chosen-reject" samples for DPO in the complex language space is challenging.

Furthermore, DPO's training stability was inferior to KTO's. DPO frequently experienced training failures and tended to generate long, repetitive, or unparseable text. We specifically used DPO-ftx to achieve stable training. Given these challenges, we believe that KTO is a more suitable and robust approach for this complex domain.

Q4: Missing Related Works.

A4: Thank you for pointing out these relevant work. We will add them to our related work section. As [5] and [6] were published very recently, we will incorporate these valuable works into our revised paper and discuss their different approaches. For instance, we will clarify that [6] focuses on no-press Diplomacy, a setting with less complex communication, while [5] utilizes clustering on embeddings, which differs from our method of optimizing directly in the natural language space.

Q5: Title Clarification: suggesting replacing "Reinforcing" with "Enhancing" in the title to avoid confusion with reinforcement learning.

A5: We agree that "Reinforcing" might create an unintended association with traditional RL algorithms. To more accurately reflect our contribution and avoid any potential misunderstanding, we will revise the title to use "Enhancing".

2025-08-07

Thank you for your response! Please see my discussion below.

For Q1, I think the authors may not fully get my point. To be clearer, my question is: what is the algorithm novelty of MaKTO compared to KTO, beyond applying it to multi-agent games? I agree that MaKTO is an end-to-end method that directly optimizes LLMs, but KTO itself is also an end-to-end training method, which diminishes the authors' claimed core contribution. From KTO to MaKTO, one would expect novel algorithm designs to address the challenges in multi-agent settings. As I have discussed in my review, the main designs of the proposed methods discussed by the authors are: (1) applying KTO to multi-agent settings (2) diverse gameplay, which is common practice (3) stepwise selection or reward shaping in RL terms. These designs do not provide new insights on how to train LLMs in multi-agent settings beyond applying existing single-agent methods. This is my major concern.

For Q2, I think the new explanation makes the motivation much clearer and has addressed my previous concern.

For Q3, I understand that it's hard to apply DPO directly in long-horizon games. But MaKTO is also not directly optimizing the full-length trajectory. Instead, the method uses stepwise preference data selection to optimize in one turn. As I discussed in my review, we can apply the same stepwise preference data selection to determine the chosen and rejected responses for DPO. So I don't think the authors' arguments for exponential states in long-horizon games hold for my question in the review. And I'm expecting experiment results that only replace KTO with DPO in MaKTO.

Q4 and Q5 are fully addressed.

In summary, my main concern on minor novelty beyond applying KTO in multi-agent settings, and comparison with DPO is not fully addressed. And I'll keep my score.

评论- Follow-up on Rebuttal for Submission 11638

2025-08-04

Dear Reviewer Je77,

Apologies for the direct message, and thank you again for your valuable and detailed review of our paper.

We are writing to follow up on the rebuttal we submitted last week. In our response, we have clarified the paper's novelty and motivation and added an experimental comparison with DPO. We hope our response has sufficiently addressed your concerns.

As the discussion period deadline is approaching, we would be very grateful for the opportunity to discuss any final questions you may have. If there are remaining issues, We are available to provide any further clarification needed. Thank you!

Best regards,

Submission 11638 Authors

2025-08-05

Dear Reviewer Je77,

Thank you again for your time and valuable feedback on our paper. We are writing to gently follow up on the rebuttal we submitted several days ago. We wanted to ensure our responses were clear and to see if there are any remaining points that require further clarification. We would be very grateful for the opportunity to discuss any issues before the discussion period concludes.

Best,

Submission 11638 Authors

评论- Further Discussion on Remaining Concerns

2025-08-08

Thank you for your follow-up. We are pleased that our responses on motivation (Q2) and other points (Q4, Q5) have resolved your concerns.

Q1: We would like to elaborate on the methodological contribution of MaKTO below:

First, from KTO to MaKTO, our work makes two key findings: (1) training with a diverse pool of agents is crucial for robustness in a multi-agent environment, and (2) intermediate, stepwise rewards are effective. We understand your point and agree that these concepts have been partially observed or mentioned in other related works, which we also acknowledge in our paper (e.g. Line 311). However, our key contribution here is that we are the first to systematically and comprehensively demonstrate and validate these findings within the context of the KTO for complex language games.

Second, we believe the disagreement is on the definition of methodological novelty. You are correct that the components of MaKTO (population-play, reward shaping, KTO) are established concepts, and our contribution is NOT in inventing these components. More importantly, we empirically validate that this specific combination of components is crucial for successfully training an interactive agent for social language games. As prominent works [1, 2] have highlighted, such games are a vital testbed for advancing an LLM's strategic and linguistic capabilities. Our work takes this recognized challenge and provides a simple yet effective implementation on the Werewolf testbed, demonstrating its practical significance. As a paper of application track, we believe our framework has significant value to the language game AI community (a contribution that also includes our annotated dataset).

On Q3: The Comparison with DPO.

Thank you for acknowledging the difficulty of optimizing DPO in long-horizon games. We would like to further clarify that even with a stepwise approach, it is non-trivial to construct (context, chosen, rejected) speech pairs for DPO.

To illustrate: in Table 10, a Seer's speech that results in receiving more than half of the villagers' votes is considered as "unacceptable". But how would we construct a DPO pair for this? E.g. for this game snippet:

After Night 1, Player 4 (a villager) is dead.

Player 1 (Werewolf): I'm the Seer. I checked Player 3; he's a villager. This helps us narrow down the options...

Player 2 (Seer): I am the real Seer! Player 1 is a fake. He is a werewolf. I checked Player 3, and he is also a werewolf. Believe me, we have found two werewolves. Let's vote out Player 3! ...

Player 3 (Werewolf): "Player 2 is the liar! He's a werewolf trying to frame me. I'm a simple villager. Player 1 must be the real Seer. How could two werewolves expose themselves so quickly?..."

... (other players' speeches are omitted.)

Vote Outcome: The majority of players believe Player 1 is a seer and vote out Player 2 (the real Seer).

In this case, the real Seer's speech (Player 2), despite being factually correct according to the game state, is an unacceptable sample for KTO because it failed to persuade the team and led to a disastrous outcome. Now, for DPO, we have the rejected sample (Player 2's speech). But the critical question is: what is the corresponding chosen speech given the same context? There is no objectively "correct" or "winning" speech to construct. This ambiguity in creating a chosen sample for strategic and persuasive language makes a simple replacement of KTO with DPO problematic.

"Stepwise" data does not necessarily imply that effective preference pairs can be easily generated. Despite this challenge, our simplified DPO experiment in the rebuttal has shown that it performed worse than MaKTO and suffered from training instability. This aligns with findings in other works [3,4,5] that note DPO's drawbacks.

We hope this deeper dive has clarified our contribution. Our work provides a practical and robust framework for a challenging problem, and we respectfully ask that you might reconsider your evaluation in light of these more nuanced explanations. Thank you again for your review, which has pushed us to better articulate the foundations of our work.

[1] Wen, Ying, Ziyu Wan, and Shao Zhang. "Language Games as the Pathway to Artificial Superhuman Intelligence." arXiv preprint arXiv:2501.18924 (2025).

[2] Schaul, Tom. "Boundless socratic learning with language games." NeurIPS Workshop on Language Gamification (2024).

[3] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

[4] Pal, Arka, et al. "Smaug: Fixing failure modes of preference optimisation with dpo-positive." arXiv preprint arXiv:2402.13228 (2024).

[5] Saeidi, Amir, et al. "Insights into alignment: Evaluating dpo and its variants across multiple tasks." arXiv preprint arXiv:2404.14723 (2024).

2025-08-09

Thank you for your responses. I will take them into consideration in forming my final evaluation.

审稿意见

评分: 3置信度: 32025-07-03

The paper introduces MaKTO, a framework that trains LLMs to play social deduction games like Werewolf through multi-agent interaction and preference optimization. It outperforms GPT-4o by 23% in win rate and matches human-level performance while remaining indistinguishable in Turing tests.

优缺点分析

Strength:

The paper extends KTO to a multi-agent setting and demonstrates strong performance in the Werewolf game.
Several models are selected for comparison, and MaKTO shows a clear advantage.
Constructing an expert-level Werewolf dataset.

Weakness:

Although the authors mention in the limitations that using online reinforcement learning is another promising direction, the absence of it as a baseline in the main experiments remains a significant issue. This is because, during the preference data selection process, the rule-based reward has essentially been defined, making online RL training feasible within the context of this paper.
Simply replacing the Guard with the Hunter to evaluate generalization of MaKTO is insufficient.

问题

In the Werewolf game, the number of players has a significant impact on the game dynamics and outcomes. Could the authors evaluate MaKTO's performance under settings with a different number of players than in the training data, in order to better assess its generalization ability?
In Appendix G, the authors attribute the poor performance of SFT w/ desirable to the reduced amount of data. Could the authors provide a detailed data comparison between the desirable and undesirable data?
This paper only evaluates MaKTO on Qwen2.5-72b, is it also effective when applied to smaller models?
Any comparable methods? (e.g. https://arxiv.org/pdf/2310.18940)

局限性

Yes.

最终评判理由

I maintain that this work constitutes a straightforward application of KTO to datasets derived from multi-agent interactions, without introducing any substantive methodological innovations or theoretical adaptations for multi-agent scenarios.

格式问题

N.A.

作者回复

2025-07-31

Thank you for your valuable feedback and insightful comments on our paper. We truly appreciate the time and effort you dedicated to reviewing our work. In the following, we address your comments in detail.

Q1: Adding Online RL as a Baseline, noting that rule-based rewards make it feasible within the paper's context.

A1: We appreciate this suggestion. However, Directly optimizing Large Language Models (LLMs) with online RL in complex multi-agent, long-utterance conversational environments is non-trivial and computationally intensive. To the best of our knowledge, there is no existing work that directly optimizes LLMs in such complex multi-party dialogue game scenarios. While we define "desirable actions", these are not strictly rule-based rewards that guarantee a win or loss. For instance, as shown in Table 10, a Seer identifying a Werewolf is a desirable action, which generally increases win probability. However, as illustrated in the case in Appendix H, there are cases where the Seer doesn't identify any Werewolf, yet the Villager team still wins through their deductions.

It is also important to note that we do include an "Online RL" baseline in our experiments: the Mix model [1]. This Cicero[2]-like policy model, which combines Online RL policy model with an LLM as an expressor. Its results are presented in Tab.1 and Fig.3-6. In the RL training for the policy model, we format utterances and actions into structured inputs (e.g., "Player X labels Player Y as Z identity") and employ a MARL approach as in AlphaStar[3] for policy modeling, using the LLM for speech generation. As mentioned in Figure 3 and lines 168-170, while the Mix model shows a good overall average win rate, it tends to expose the Werewolf team more easily. We believe the Mix model serves as a relevant Online RL baseline. Due to space constraints, the description of this baseline was concise in the main text, but we we will expand on its detailed description in the revised paper.

Q2: Generalization Evaluation: Simply replacing the Guard with the Hunter is insufficient to evaluate generalization ability; recommending evaluating under settings with a different number of players.

A2: This is an excellent suggestion, and we agree that broader generalization testing is crucial. We have conducted additional experiments with three new, unseen game settings, which were not present in either our behavior data or KTO training data:

10-player game: Seer, Witch, Guard + 4 Simple Villagers + 3 Werewolves
10-player game: Seer, Witch, Hunter + 4 Simple Villagers + 3 Werewolves
12-player game: Seer, Witch, Hunter, Guard + 4 Simple Villagers + 4 Werewolves

Win rate in a 10-player game with Guard:

	GPT-4o	Claude	SFT-72b	MaKTO-72b	avg.
GPT-4o	0.5	0.42	0.24	0.18	0.335
Claude	0.58	0.5	0.42	0.36	0.465
SFT-72b	0.76	0.58	0.5	0.46	0.575
MaKTO-72b	0.82	0.64	0.54	0.5	0.625

Win rate in a 10-player game with Hunter:

	GPT-4o	Claude	SFT-72b	MaKTO-72b	avg.
GPT-4o	0.5	0.28	0.3	0.2	0.32
Claude	0.72	0.5	0.55	0.52	0.5725
SFT-72b	0.7	0.45	0.5	0.44	0.5225
MaKTO-72b	0.8	0.48	0.56	0.5	0.585

Win rate in a 12-player game:

	GPT-4o	Claude	SFT-72b	MaKTO-72b	avg.
GPT-4o	0.5	0.4	0.48	0.31	0.4225
Claude	0.6	0.5	0.52	0.33	0.4875
SFT-72b	0.52	0.48	0.5	0.44	0.485
MaKTO-72b	0.69	0.67	0.56	0.5	0.605

Our experiments show that MaKTO consistently achieves a better winning rate compared to the SFT model across all three new settings without any additional training. The performance improvement stems from MaKTO's ability to enhance the model's underlying common reasoning capabilities and strategic skill release through interactive feedback.

Q3: Detailed Data Comparison between desirable and undesirable training data.

A3: Here are the detailed stats of the expert-annotated data:

# Desirable	# Unacceptable	Total
7362 (60.7%)	4763 (39.3%)	12125

The purpose of the "SFT w/ desirable" experiment was to investigate whether filtering out "unacceptable" responses would lead to better performance compared to training SFT on the full dataset. Our findings indicated that, on average, training only on desirable actions actually resulted in worse performance than training on the full SFT dataset and then doing MaKTO training .

Q4: Whether MaKTO is effective to apply to smaller models, beyond Qwen2.5-72b.

A4: Yes, MaKTO's effectiveness is demonstrated on smaller models as well. In our paper, we provide performance evaluations for the 14B model size in Table 1, Figures 3-4, and Tables 4-8. Specifically, our ablation study in Section 3.6 confirms that the MaKTO algorithm is effective and performs well at the 14B model scale.

Q5: Comparable Methods, eg. [4].

A5: We appreciate you bringing up this work. The paper you cited is indeed relevant and has been mentioned in our introduction and related work sections. Unfortunately, as the authors have not open-sourced their code, we were unable to reproduce their results or conduct a direct comparison. However, as discussed in A1, we have compared our approach against another important baseline: the Mix model [1]. This is a Cicero-like model that integrates RL with an LLM.

[1] Wu, Shuang, et al. "Enhance reasoning for large language models in the game werewolf." arXiv preprint arXiv:2402.02330 (2024).

[2] Meta Fundamental AI Research Diplomacy Team (FAIR)†, et al. "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." Science 378.6624 (2022): 1067-1074.

[3] Vinyals, Oriol, et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." nature 575.7782 (2019): 350-354.

[4] Xu, Zelai, et al. "Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game." In Proc. of ICML2024.

2025-08-06

Thanks for your reply.

While we appreciate the detailed explanations regarding generalization evaluation and data comparison, the fundamental issue remains questionable. Although the MAKTO training algorithm demonstrates improved performance for werewolf agents, its underlying paradigm essentially constitutes a single-agent reinforcement learning framework with data collection capabilities, rather than a genuine multi-agent system. This approach lacks sufficient methodological novelty as it fails to address the core challenges of multi-agent learning. So I choose to keep the original score.

评论- Follow-up on Rebuttal for Submission 11638

2025-08-04

Dear Reviewer 3tA5,

Apologies for the direct message, and thank you again for your valuable and detailed review of our paper.

We are writing to follow up on the rebuttal we submitted last week. In our response, we have provided a clarifications on the comparison with the Online RL method and other comparable methods. We have also included the experimental results on generalization, data statistics of desirable and undesirable training data, and results of smaller models. We hope our response has sufficiently addressed your concerns.

Best regards,

Submission 11638 Authors

2025-08-05

Dear Reviewer 3tA5,

Best,

Submission 11638 Authors

评论- On the Evaluation of Our Stated Contributions

2025-08-06

Dear Reviewer 3tA5,

Thank you for your follow-up. We are very pleased to hear that our detailed explanations regarding the generalization evaluation and data comparison have successfully addressed your concerns. We appreciate you taking the time for this important discussion on what appears to be the remaining fundamental issue: the novelty, i.e. core contribution.

While we agree that advancing multi-agent RL is a promising research direction, NOT every paper on agent learning must have this as its primary objective. We respectfully ask that our work be evaluated on its stated contributions: the proposal of a novel unified training paradigm for language game agents and the strong empirical results that validate its effectiveness. We believe a rejection based on our paper not solving a different set of research problems CANNOT be fully justified.

With this context in mind, here are our specific clarifications on the paper's novelty and methodology:

On Methodological Novelty and Scope: Our paper's primary goal is not to solve the foundational challenges of multi-agent training, and we have never claimed to propose a "genuine multi-agent system" in the traditional MARL sense in the paper. Instead, our core contribution focuses on a different but crucial problem: how to effectively train a single, unified policy that can jointly handle both language communication and strategic decision-making in complex social deduction games. This is a significant difference from existing literature [1,2,3], which typically trains communication and decision-making policies in two separate stages. As our experiments show, our unified approach demonstrates superior performance.
The MaKTO framework is a deliberate methodological choice to address this specific challenge efficiently. As we noted (Lines 114-118), the huge action spaces in multi-agent dialogues lead to sparse trajectory sampling, making online RL algorithms difficult to train. We were encouraged that your initial review recognized this contribution, describing our work as* an extension of KTO to a "multi-agent setting" with strong performance* (Strength 1).
On the "Multi-Agent" Paradigm: As for your mention that "its underlying paradigm essentially constitutes a single-agent reinforcement learning framework with data collection capabilities". Regarding the paradigm being a "single-agent framework," we would like to offer a different perspective. The key distinction between multi-agent and single-agent approaches often lies in whether the data collection process involves dynamic interactions among multiple agents or not. Our framework's data is generated precisely from such multi-agent interactions. This paradigm — training a centralized policy on data from multi-agent gameplay — is a common and highly successful training approach used in game AI research, including OpenAI Five[4] and AlphaStar [5]. Therefore, our method firmly presents a multi-agent approach.

We hope the above explanation can successfully clarify the core contribution of our work.

We again respectfully ask that our work be evaluated based on its stated contributions and its demonstrated success in this domain, rather than for not addressing a different set of research problems. Given these clarifications, we would be sincerely grateful if you would be open to reconsidering your evaluation. Thank you again for your time and for this important discussion.

[1] Meta Fundamental AI Research Diplomacy Team (FAIR)†, et al. "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." Science 378.6624 (2022): 1067-1074.

[2] Shibata, Hisaichi, Soichiro Miki, and Yuta Nakamura. "Playing the Werewolf game with artificial intelligence for language understanding." arXiv preprint arXiv:2302.10646 (2023).

[3] Xu, Zelai, et al. "Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization." arXiv preprint arXiv:2502.04686 (2025).

[4]Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. 2019. Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680. Retrieved from https://arxiv.org/abs/1912.06680

[5]Vinyals, O., Babuschkin, I., Czarnecki, W.M. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z

2025-08-08

After carefully considering the paper's motivation and stated contributions, I maintain that this work constitutes a straightforward application of KTO to datasets derived from multi-agent interactions, without introducing any substantive methodological innovations or theoretical adaptations for multi-agent scenarios.

2025-08-08

Thank you for your reply. We regret that after multiple rounds of discussion, you still hold this perspective.

As we explained in our response to Reviewer Je77, our key contribution here is that we are the first to systematically and comprehensively demonstrate and validate these findings within the context of the KTO for complex language games. Furthermore, as a paper for an application track, we believe our framework has significant value to the language game AI community (a contribution that also includes our annotated dataset).

评论- Follow-up and General Response

2025-08-02

Dear Reviewers,

We would like to once again express our sincere gratitude for your detailed and constructive feedback. As now in the discussion period, we are eager to engage in a discussion. To facilitate this, we have identified and summarized several common questions raised from the reviews. We hope this summarized response helps to clarify.

Q1: On adding Online RL as a comparison, and its potential for discovering novel strategies. (From Reviewers 3tA5, uAsG)

A1: First, we would like to clarify that directly optimizing LLM with online RL in a complex, multi-agent, long-utterance SDG environment is a significant challenge due to its high computational cost. To our knowledge, no existing work has successfully done this for such a complex dialogue-based game before. However, we did include a strong Online RL baseline in our paper: the "Mix" model.(Tab.1 and Fig.3-6), with an approach similar to Cicero and AlphaStar:

Decoupled Policy and Expression: It uses a policy network trained with MARL for decision-making. The LLM then acts as an "expressor" to translate the policy's decision (e.g., "Player X votes for Player Y" etc) into descriptive natural language.
Online Training: The policy network is trained online through self-play. Its reward function includes both the final game outcome (win/loss) and rule-based rewards for desirable actions within a round (e.g., the Seer correctly identifying a Werewolf), which are similar to the heuristics we use in Table 10.

Performance: MaKTO model achieves a higher average win rate than the Mix model.
Discovery of Novel Strategies: An interesting point is whether online learning can discover strategies different from human experts. We found that it can, but these strategies are not always superior. For instance, we observed interesting behavior from the Mix model when playing as a Werewolf (as noted in Figure 3, lines 169-171). It learned an overly aggressive strategy, such as having all three Werewolves claim to be the Seer in one day-time discussion round. In such case, if the real Seer managed to win the trust of the villagers, this strategy would lead to all three Werewolves being immediately exposed.

This indicates that while online RL can indeed explore novel strategies, while our MaKTO framework, which learns from interactions with a diverse pool of agents, produces more robust and effective strategies for this complex game.

Also, we will add a more detailed description of the Mix model's training to the appendix in our revised version.

Q2: On the model's generalization ability to unseen game configurations and its potential to transfer to other social deduction games. (From Reviewers 3tA5, uAsG, ghkm)

A2: We have addressed this in two ways. 1. Generalization to Unseen Configurations within Werewolf: We add new experiments on three game settings that were entirely absent from our training data (both the SFT and KTO stages):

10-player game with Guard: Seer, Witch, Guard + 4 Simple Villagers + 3 Werewolves
10-player game with Hunter: Seer, Witch, Hunter + 4 Simple Villagers + 3 Werewolves
12-player game: Seer, Witch, Hunter, Guard + 4 Simple Villagers + 4 Werewolves

Specific results can refer to the response to Reviewer 3tA5. They have shown that MaKTO enhances the model's underlying reasoning and strategic skills.

2. Potential for Generalization to Other SDGs:

We believe that different SDGs, such as Werewolf and Avalon, share a significant overlap in their core strategic spaces (e.g., identity concealment, logical deduction). The "think-before-respond" data format we use helps the model learn these fundamental strategic principles. We hypothesize that our trained agent could be effectively adapted to other SDGs with minor fine-tuning, a direction we are excited to explore in future work (as mentioned in Appendix J).

Q3: On the influence of the language used (Chinese) in the detectability test. (From Reviewers uAsG, ghkm)

A3: All experiments were conducted exclusively in Chinese. GPT-4o's high detectability was not due to poor language fluency.The key reasons for its detectability are its distinct strategic patterns, like

Overly Cautious: GPT-4o often hesitated to make bold accusations, even when a suspect's identity was fairly obvious to experienced human players.
Hesitant Voting: It frequently abstained from voting or voted for non-critical players, especially in crucial situations like a face-off between two players claiming to be the Seer, where a human player would typically take a stance.

We will add a more detailed description of these behaviors in the revised paper.

We hope this general response will be helpful. We have provided more detailed explanations in our individual rebuttals and are standing by to answer any further questions you may have.

Best regards,

Submission 11638 Authors

最终决定Accept (poster)

2025-09-17

This paper investigates LLM behavior in the social deduction game Werewolf. The authors collect a new dataset of annotated Werewolf games played by human experts, capturing the players' utterances and actions during the game and chains-of-thought behind their decisions. They then investigate how to improve the performance of an LLM on the Werewolf game. The authors first finetune the LLM on the Werewolf dataset using SFT: this behavior cloning gives the model an understanding of the gameplay and terminology, and makes the LLM behavior more closely match that of human players. Next, the authors propose multi-agent Kahneman and Tversky's Optimization (multi-agent KTO or MaKTO), which is a variant of KTO that uses a pool of agents instead of self-play and uses a heuristic stepwise preference data selection mechanism instead of simple win-loss outcomes.

The main contributions of the paper are the new Werewolf dataset and the empirical analysis. The methodological contribution is limited, but this is not the main purpose of the paper.

Most of the paper is devoted to empirical evaluations: they evaluate their approach in the 9-player Seer-Witch-Guard Werewolf setting through tournaments, human-AI competitions, and a Turing-style detectability test.

They find that the proposed approach MaKTO-72b outperformed baseline LLMs and had the highest TrueSkill rank rating. Their model also ranked fourth among a mix of human and AI players in a human-AI competition. The authors also performed a Turing-style detectability test, in which human players were asked to determine whether each other participant is a human or AI. They found that MaKTO was detected as AI 48.9% of the time, compared to baseline GPT-4o which was detected 76.6% of the time. This is expected because GPT-4o was not finetuned on the Werewolf data.

In Table 1, the authors compare win rates of different models (GPT-4o mini, GPT4o, Claude) and a few variants of their approach (SFT-14b, SFT-72b, and MaKTO-72b) in the Seer-Witch-Guard setting of Werewolf.

The reviews for this paper were mixed, with scores {3, 3, 5, 5}.

Reviewer 3tA5 raised concerns regarding the lack of an online RL baseline and insufficient experiments to evaluate the generalization of MaKTO. This reviewer also pointed out that the paper only evaluates MaKTO on Qwen2.5-72b but not smaller models. The authors responded that their Mix baseline is an online RL method, and they provided a generalization analysis for Werewolf games with 10 and 12 players. Reviewer 3tA5 was not fully convinced by the rebuttal, and maintained that a key issue was the lack of methodological novelty of Multi-agent KTO and the lack of comparison to Xu et al., “Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game” ICML 2024.

Reviewer Je77 raised concerns regarding the limited novelty, noting that most of the components of the proposed approach have been done in prior work, e.g., Kahneman-Tversky Optimization, using population play instead of self-play, and using stepwise preference data selection. This reviewer also raised concerns regarding the lack of comparison with DPO.

Reviewer uAsG raised concerns regarding missing comparison to online MARL, the generalizability of the method to different social deduction games, missing evaluations of other LLMs, and potential issues with the Turing-style detectability test. The authors responded that they used the Mix online RL baseline. They also provided another baseline using Llama-3.1-Chinese-Chat models, but they only showed SFT results, not MaKTO.

Reviewer ghkm raised concerns regarding the limited experiments only on a single game Werewolf, missing discussion of the computational cost of the multi-agent play and KTO loop, the choice of agent pool composition and LLM models, and missing ablations. The authors did not provide results on new games, but they did perform an ablation over the agent pool diversity.

The paper makes a good contribution in terms of the dataset it introduces and empirical analysis it performs. Reviewers disagreed about whether the paper requires a methodological contribution, and overall it seems that the paper has sufficient contribution even without this.