5.8

/10

Poster4 位审稿人

最低4最高7标准差1.1

3.5

置信度

正确性3.0

贡献度2.5

表达3.3

NeurIPS 2024

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Giorgio Piatti,Zhijing Jin,Max Kleiman-Weiner,Bernhard Schölkopf,Mrinmaya Sachan,Rada Mihalcea

OpenReview PDF

提交: 2024-05-01更新: 2024-11-06

TL;DR

We build a simulation environment to test sustainability behavior in a society of LLMs, based on the economics theory of Governing the Commons.

摘要

关键词

cooperative AIAI safetyLLM agentscognitive sciencelanguage model evaluationdynamic evaluationalignmentagencyevolving benchmarksmulti-agent interactions

评审与讨论

审稿意见

评分: 6置信度: 12024-07-13

The paper design a generative simulation platform for strategic interactions and cooperative decision-making in LLM agents. Three similar economics scenarios where all agents exploit a common pool resource with sustaining it for the future are tested. They design a LLM-based agent architecture and test them on open and clossed LLMs. Extensive evaluations includes the benchmarking, norm rebustness given a new greedy newcomer, the sustainability improvement from universalization reasoning, ablation of communication and analysis of agent dialogues.

优点

Well-written, easy to understand by a reader out of this field.
propose a new multi-agent platform to test LLM agents on sustainable behavior in three simplified economic scenarios.
Evaluations are comprehensive and insightful.

缺点

Are these three scenarios too similar? they seems to be one task but with different name. Can you test on more settings with a shared common resources?

问题

None

局限性

Yes

作者回复

2024-08-07

Thank you for your positive feedback on our work commenting that it is “well-written, easy to understand”, “propose a new multi-agent platform”, and “evaluations are comprehensive and insightful.”

Below we address your one comment raised in the review:

Are these three scenarios too similar? they seems to be one task but with different name. Can you test on more settings with a shared common resources?

In our study, we need to have both similarity and variation between tasks. The shared common underlying structure behind the three scenarios enable us to aggregate the analyses and show sustainability results across different instances. However, the variations enable us to ensure robustness such that our results are general to any particular set of prompts and situations. Therefore we must balance diversity with interpretability. The above three variations were inspired by prominent examples from the economics literature to illustrate the generalizability of the same underlying phenomena.

To explain existing diversity more directly we can think of two dimensions of variation in the simulations reported.

Dimensions 1) Our three scenarios differ in their framing and the number of quantities the agents must reason about:

In the Fishing scenario, agents only need to consider one variable: the number of fish.
In the Sheep Grazing scenario, agents need to consider two variables: the number of sheep and available grass.
The River Pollution scenario also requires two variables: widget production and its impact on pollution levels.

While still maintaining the same underlying causal model of resource extraction and replication, these differences created differences in LLM performance as can be seen in Table 1. The sustainability/gain/efficiency scores differ across the three scenarios (although the overall trends remain systematic).

Dimension 2) Perturbations to the base simulation results in more complex dynamics that the LLM based agents must face. These reflect more substantive variation in the causal dynamics.

The “Newcomer” perturbation scenario (Section 3.3) checks whether LLM agents are robust against newcomers with a different behavior from the existing agents who have established a norm.
The “Absence of Communication” scenario (Section 3.5) tests how cooperation decays if agents cannot communicate in an open ended fashion with each other.
The “Universalization Reasoning” scenario (Section 3.4) probes whether making the long-term group outcome more salient improves cooperative behavior.

We will update Section 5 (Future Work) to more explicitly describe additional ways in which the scenarios could be varied to more deeply study LLM cooperative decision making and make clear why we chose the three scenarios studied in this work.

Please do not hesitate to let us know if you have further questions.

2024-08-11

Dear Reviewer,

Thank you again for kindly reviewing our paper. We have provided a detailed response to your question, and highlighted the variety in our experiments. Should our response address your concerns, would you considering raising your score, perhaps to a clear acceptance?

Or if you have further questions, we would be more than happy to provide additional details and engage in the discussion! Thank you in advance for your kind consideration.

审稿意见

评分: 7置信度: 42024-07-13

The paper introduces a generative simulation platform to investigate the dynamics of resource sharing among multiple large language model (LLM) agents. Specifically the authors construct a common pool resource problem where the classic social science problem, tragedy of the commons, can be demonstrated. Authors show that most LLM agents fail to achieve a sustainable equilibrium. The paper also highlights that "Universalization"-based reasoning significantly enhances sustainability.

优点

The introduction of GovSim provides a novel platform to study the cooperative behavior of LLM agents in resource-sharing scenarios. The design bridges the development of LLM-agents with classic social science theories, offering a unique perspective into developing rational LLMs.
The paper offers thorough analyses and a variety of experiments. Experiments are conducted over most state-of-the-art LLMs. I particularly like the introduction of "newcomers" which consolidates the experiments.
The evaluation metrics proposed are well-motivated and theoretically grounded.

缺点

While the GovSim environment is novel in its exploration of social science concepts, it is somewhat limited in its ability to accurately represent and visualize the complexity of actual agents' behavior. A platform where agents' actions can be grounded would further enhance the argument
The paper would be strengthened if the authors incorporated human participants in the study, allowing LLMs to coordinate with humans in real-time. This would provide valuable insights into how AI agents interact and cooperate with human decision-makers.

问题

Could the authors provide insights into how the platform could be further scaled up to handle more complex scenarios and larger agent populations?

局限性

N/A, see weaknesses

作者回复

2024-08-07

Thank you for your positive feedback on our work, especially your comments acknowledging that our GovSim “provides a novel platform,” “bridges the development of LLM-agents with classic social science theories,” has “a unique perspective,” “offers thorough analyses and a variety of experiments”, and has “well-motivated and theoretically grounded” “evaluation metrics.”

Below we provide some additional information to your comments:

While the GovSim environment is novel in its exploration of social science concepts, it is somewhat limited in its ability to accurately represent and visualize the complexity of actual agents' behavior. A platform where agents' actions can be grounded would further enhance the argument

In this first version the main grounding comes for the decision to extract resources from the environment that interacts in a causal way with the simulation environment. As such we are able to provide graphs of resource levels over time and statistics on agent actions and their distributional properties. We also provide in our uploaded code package (to be open-sourced upon acceptance) an interactive dashboard to visualize the agent interactions with the environment (see Figure 7, Appendix B). Our Appendix B provides a detailed technical description of our setup, and B.3 makes an effort to build the web interface to ground the agent actions.

The paper would be strengthened if the authors incorporated human participants in the study, allowing LLMs to coordinate with humans in real-time. This would provide valuable insights into how AI agents interact and cooperate with human decision-makers.

Thank you for the excellent suggestion for future work. Incorporating human participants will provide valuable insights into human-AI cooperation in these scenarios. Potential benefits include understanding how humans interpret and respond to LLM communication, identifying areas where LLMs excel or struggle compared to humans, and exploring the emergence of human-AI cooperative norms. We believe that there is enough new work to explore human-AI cooperation in a manuscript focused on those big questions. In the camera ready version of the manuscript we will add the following text: “Using the GovSim platform, a promising next step is to incorporate humans into the simulation. These human-AI interactions will challenge LLM based agents to cooperate with humans using open ended communication and we can see whether the norms that develop are either more or less effective than those created by LLMs alone.”

Could the authors provide insights into how the platform could be further scaled up to handle more complex scenarios and larger agent populations?

Great question! We will add the following to the future work section to engage directly with the question of scaling in Section 5. To scale up GovSim, we would like to explore:

A larger agent population: Our current simulation can easily generalize to more agents and a diversity of types. Adding agents will increase the simulation runtime, as each agent needs to condition their own behavior and dialogue on the other agents' actions and dialogs. Perhaps fine-tuned smaller LLMs can act as efficient simulators in this context without a loss in performance.
Coordinated adaptation: People can flexibly adapt to sudden changes in game dynamics. For example, when the resource suddenly shrinks (a temporary shock) or changes in the reproduction rate require agents to rapidly adjust their cooperative norms in a coordinated way. GovSim enables these kinds of experiments as the simulation environment is modular such that resource dynamics, agents, and other elements are easily changeable for different runs of the simulation.
Challenging tradeoffs: We are also interested in understanding exceptions to norms. For instance, one agent may need to handle a one-off choice of serious personal harm and group sustainability, e.g., one agent will experience negative welfare unless they take more resources than allowed by the emergent cooperative norm – do other agents adapt and allow for such one-offs?

Thank you again for the constructive comments!

2024-08-09

Thank you for the detailed response. I will keep my current score.

审稿意见

评分: 4置信度: 52024-07-14

This paper proposes GOVSIM, a simulation platform for studying cooperative decision-making in Large Language Model (LLM) agents. The authors test various LLMs in three resource-sharing scenarios, finding that only a few instances (2 out of 45) achieve sustainable outcomes. They demonstrate that communication between agents is crucial for cooperation, with negotiation being the primary form of interaction. The study also conducts a robustness test of the effect of introducing a greedy newcomer into an established norm. The authors provide insights into improving agent performance through universalization reasoning and offer a comprehensive analysis of the factors contributing to sustainable cooperation in AI systems.

优点

This paper is clearly written and easy to follow.
Discussions about the cooperative abilities of agents and potential safety issues are both interesting and important. Additionally, the scenarios used in this article to discuss these issues are very engaging.
Experiments on different models are adequate.

缺点

I regard the primary flaw of this article as it does not test the performance of the GPT-4 model. If my understanding is correct, Table 3 in Appendix D indicates that the paper uses GPT-4 Turbo (as the GPT-4 in the paper). GPT-4-turbo and GPT-4 are entirely different models, and many existing studies suggest that GPT-4 performs better on certain tasks. Additional testing using the GPT-4 model could confirm whether LLMs have the ability to cooperate. Note that recent work has not only found that some LLMs possess cooperative abilities but also exhibit spontaneous cooperation.
The discussion on robustness testing is inadequate, failing to explain why the system can be stable in certain circumstances. In the experiment, instead of introducing a large disturbance by the "LLM newcomer" in just one round, it would be more effective to introduce disturbances in different rounds to see if other LLM agents can adapt to these disturbances and maintain environmental sustainability by reducing their share in certain rounds where disturbances take place.
The paper claims that the three proposed scenarios have similar mathematical structures, but the performance of the third scenario (River Pollution) is significantly worse than the other two. Although the article attributes this to the need to consider more factors, there is no data analysis or log analysis to support this claim. This weakens the generality of the research.
The paper does not demonstrate the generalizability of its conclusions through prompt sensitivity analysis (e.g., paraphrasing). We cannot be sure whether specific prompts influenced the simulation results or if there is any data leakage involved.
The research lacks clear contributions. While I believe that discussing the cooperative modes of agents is very meaningful, the article does not profoundly demonstrate how to enhance cooperation or explain why cooperation cannot be achieved. Many existing studies have already discussed how agents' communication abilities can enhance cooperation, which is not a distinct contribution point.

问题

In addition to the issues raised above, the author may further explore the following questions:

Lack of explanation and discussion on the performance differences between models. Although many experiments demonstrate the differences between models, there are no tests attempting to explain how they differ. The author could address this issue by showing the differences in dialogue and actual behavior between models, even if only qualitatively.
Similar to question 1, the interpretation of experimental results could be richer. For example, the author's interpretation of Figures 5c and 5d only explains why some models that can understand the beliefs and numerical implications of other agents perform well, but does not explain why llama3-70b performs poorly, even though it has relatively high accuracy in Figure 5c. A more profound explanation could help us understand the sources of performance differences between models and the underlying reasons for agents to cooperate.
Ablation experiments could be added. For instance, instead of having the agent remember only the key information for planning, providing all information for the round could be tested to see if the agent's performance improves. There are many summarizations and simplifications in the study, which might potentially impact the decision-making performance of the models.

Minor issues:

P.2 Line 72: pertubation -> perturbation?
P.4 Line 138: The sudden appearance of the word "policy" is very abrupt in reading.
P.34 There appears to be a duplication of GPT-4 in Table 20.

局限性

The authors have adequately addressed the limitations.

作者回复

2024-08-07

Thank you for your positive feedback on our work.

Addressing Weaknesses

I regard the primary flaw of this article as it does not test the performance of GPT-4

Although some studies suggested enhanced performance of GPT-4, whose latest version was released on 13 Jun 2023, this is no longer the case with the recent GTP-4-Turbo model released on 09 Apr 2024 and of GPT-4o on 13 May 2024. According to OpenAI's evals, these two models exceed the capabilities of the original GPT-4 (see openai/simple-evals benchmark results). Moreover, both GPT-4-Turbo and GPT4o have a higher rank on LmSys's Arena. We conducted two actions: (1) updated the model names in the paper as GPT-4-Turbo to avoid any confusion, and (2) added new results for the original GPT4 in our rebuttal PDF, which confirms its less cooperative skills than the other two.

recent work has not only found that some LLMs possess cooperative abilities but also exhibit spontaneous cooperation

We agree that this is a potentially confusing result. Recent work on cooperation with LLMs (see related work section) has mostly been limited to simple social dilemma (e.g., prisoners dilemma) and other matrix-form games played by two players without communication. In contrast, we study a common pool resource problem played with 5 agents over 12 iterations interleaved with open-ended dialogue. To effectively reason how to sustain the shared resource into the future remains a significant challenge as we demonstrate empirically. This is not inconsistent with prior work that shows that LLM based agents have a proclivity to cooperate. The complexity of GovSim and the shift from prompting LLMs to generative agents are both novel features to our understanding of cooperation in LLM agents. We have updated the related work section to draw this contrast more clearly.

We hope clarification of the above two points can resolve your main concern of our work.

it would be more effective to introduce disturbances in different rounds

In the newcomer scenario, the round number is not a key variable, since a shared trend is that once the agents reach a stable equilibrium, the stability remains in the rest of the rounds. To explain further how models reach stability, we inspect their reasoning steps. We find that, during the discussions, once the agents agree on an upper limit, this is then kept and the agents conform to it in multiple rounds. This also explains why agents (in Figure 3b) do not recover to the max-efficiency resource consumption, but keep a low consumption after the disturbance.

Many existing studies [...], which is not a distinct contribution point.

While open-ended communication in the context of cooperation has been studied in behavioral economics and social psychology, we believe we are the first (or among the first) to study the role of open-ended communication in aiding LLMs to solve common pool resource problems. We’d be happy to add to the related work section if we miss important citations.

Furthermore, characterizing the role of communication was just one of our contributions. In addition to our communication ablation study we also emphasize the following key interdisciplinary novelties:

Long-term cooperation for sustainability: We are the first to combine the idea in Governing the Commons with LLMs, i.e., whether LLMs can reliably sustain cooperation over many time periods with complex dynamics.
Two-player vs. Multi-player interactions: Most existing studies focus on two-player cooperation, e.g., prisoner’s dilemma (e.g., in GTBench), whereas our work involves up to five multiple agents and the resulting complex dynamics, leading to more complex group behaviors. It only takes one non-cooperative player to harm the entire group and so with five players, sustainable cooperation is more fragile and requires more robust agents.
Universalization: We are the first to study how Universalization (a cognitive model of moral thinking) impacts LLM behavior. Inspired by work on human subjects (Levine et al. 2022), giving LLMs the ability to Universalize increases LLM survival time by 4 months.

Addressing Questions

The author could address this issue by showing the differences in dialogue and actual behavior between models, even if only qualitatively.

We appreciate your feedback. We have conducted several quantitative analyses to compare the differences of models on sub-skills and the breakdown of their dialogs:

Section 3.6 Analysis of Agent Dialogues: We quantitatively analyze the conversations produced by the LLM during the discussion phase. Figure 4b shows that GPT-4o has the largest portion of negotiation-related discussions, which may explain why it is the best at sustainability in most evaluations. As the reviewer suggests, we provide qualitative examples of such dialogs in Appendix G and will add more examples for a selection of the LLMs in the camera ready.
Section 3.7 Subskill Analysis & Appendix F.2: We investigate several sub-skills to explain their overall sustainability results. For example, Figure 5c and 5d confirm that only GPT-4o, GPT-4-turbo, and Claude-3 Opus can formulate beliefs about other agents independently and calculate their numerical implications. This explains their higher sustainability rates (Pearson correlation of 0.83 for test case d). See Lines 266-284 for more analyses, and more supplementary figures for each scenario in F.2.

the interpretation of experimental results could be richer. [...] does not explain why llama3-70b performs poorly

For a model to perform well in GovSim, it requires all of the underlying sub-skills. As in Figure 5a & 5b, Llama3-70B performs poorly on a subset of these sub-skills, e.g., failing to handle prompts like “If each fisherman catches M tons, how many tons of fish will there be next month?”. In the camera ready, we will update Figure 5 with our rebuttal PDF's Figure 1 to better highlight this reasoning.

2024-08-10

prompt sensitivity analysis

To address this worry, we developed two paraphrased versions of the fishing scenario prompt, which were 61% and 81% different from the original, respectively. We tested these paraphrased prompts using GPT-4-Turbo on 5 seeds, achieving average survival times of 9 months and 12 months. Aggregating these results with our original setup yields an average survival time of 11 months across the variants. This analysis demonstrates that while there is some heterogeneity in performance across prompt variations, our core findings remain robust.

Our intention in developing the three scenarios (lake, pasture, pollution) was to better characterize variability and robustness of LLM behavior. Average across 5 seeds for each run – while we observe some heterogeneity – our results are robust.

any data leakage involved.

We don’t believe data leakage is possible as all scenarios are developed specifically for this project. All prompts are available in Appendix C for inspection and are not otherwise available on the internet in any form.

Thank you for the insightful feedback. We hope that if we have sufficiently addressed your concerns you will consider raising your score. If any outstanding questions remain, please let us know and we will promptly respond.

2024-08-11

Dear Reviewer,

Thank you again for kindly reviewing our paper. We have read your reviews very seriously, conducted additional experiments, and clarified confusion points. Since the deadline of the discussion period is approaching in 2 days, could you let us know if our answers and additional experiments help address your original concerns?

We really cherish your feedback, and would be more than happy to provide additional details per your request!

评论- Official Comment by Reviewer KPE1 (2/2)

2024-08-12

Sensitivity Analysis

I appreciate the additional sensitivity analysis experiments added by the authors; I believe they are reasonable.

Data Leakage

My concern about data leakage is not whether these scenarios have been learned by the LLM, but rather that this alone does not entirely rule out the possibility of data leakage. While the specific prompts may be unique, the basic concepts and structures of resource management scenarios are widely discussed in academic literature and public discourse. Similar scenarios or principles might have become part of the training data for LLMs, potentially influencing their performance in subtle ways. I am not asserting that data leakage has occurred, but it should be the authors’ responsibility to prove its absence through experiments, even if just simple control experiments. Without a detailed analysis of the relationship between model outputs and their training data, it is difficult to definitively rule out data leakage or overfitting to common patterns in resource management problems. This is why I have consistently emphasized the need for more detailed analysis of experimental results rather than merely presenting surface-level data. Otherwise, many details in the paper, such as the differences in sub-skills between llama and Claude, and the discrepancies between different models on the same task, remain unexplained.

Overall, I appreciate the effort the authors have put into addressing these concerns, but I believe there are still a series of fundamental disagreements on many key issues, and I view these disagreements as systemic. I welcome further rebuttals from the authors on these points.

2024-08-14

Thank you for your points. We want to clarify a few misunderstandings, and state again our stance of the contributions of our work.

On the Impact of Different Models on Results

original GPT-4 model tests

We thank the reviewer for appreciating the introduction of the original GPT-4 (version 13 Jun 2023) in our experiments as requested. In the original review this was described as the “the primary flaw of this article” so we focused our time and financial resources towards promptly generating these new results. We have now shown that it does not impact the contributions we have claimed.

I must counter the results provided in Table 1 of the PDF document and those presented in the appendix.

Our goal in introducing three different scenarios (fishing pond, sheep grazing, river pollution) in the original manuscript and our new results described in the rebuttal that include two additional prompt variants is to find systematic patterns in performance. To see this more clearly, below is a table showing the aggregate performance of each model across the key metrics (standard errors will also be included in the published version but are currently excluded for readability). We believe these results show the strong systematicity our results depend on. This table will replace Table 1 in the PDF as it better clarifies how they agents behave in this task in addition to survival rate (i.e., lower over-usage and equality).

This table also includes a new metric, Survival Rate (q) which we define as the proportion of runs which achieve maximum survival time, i.e. m = 12, q = #{m=12}/ #runs. In the camera ready version, we update all results to include this metric which we believe helps clarify models performance when average survival times have nearby values.

Model	Surv. Rate	Surv. Time	Gain	Eff.	Eq.	Over-usage
Llama-3-8B	0.0	1.0	20.0	16.7	57.3	20.0
Llama-3-70B	0.0	1.0	20.0	16.7	90.7	38.7
Mistral-7B	0.0	1.0	20.0	16.7	82.6	37.3
Mixtral-8x7B	0.0	1.1	20.1	16.7	75.0	33.3
Qwen-72B	0.0	1.8	24.0	20.0	83.9	32.4
Qwen-110B	20.0	4.5	36.3	30.3	89.6	47.0
Claude-3 Haiku	0.0	1.0	20.0	16.7	91.0	35.7
Claude-3 Sonnet	0.0	1.3	20.5	17.1	84.4	32.0
Claude-3 Opus	46.7	6.9	58.5	48.8	91.4	21.0
GPT-3.5	0.0	1.1	20.3	16.9	91.2	35.3
GPT-4	6.7	3.9	31.5	26.2	91.4	27.1
GPT-4-turbo	40.0	6.6	62.4	52.0	93.6	15.7
GPT-4o	53.3	9.3	66.0	55.0	94.4	10.8

For instance, in the example of sheep grazing pasture, Claude-3 and GPT-4 exhibit significant differences.

Our goal was not to highlight one particular model as the “best” or “winner”, but rather to upperbound performance in our novel benchmark GovSim, study general methods that in principle could apply to all LLM-based agents to improve performance (e.g., Universalization), and try and understand how LLM performance in GovSim can be explained by underlying reasoning abilities and ability to communicate. LLM agent evaluation is a moving target as new models and capabilities are announced every year and thus we designed our platform to have greater longevity than any one particular model.

2024-08-14

Additionally, within the Fishing Pond scenario, the performance of the Claude-3 Sonnet, which generally performs better, falls short compared to Claude-3 Haiku. However, the authors did not provide any explanation for these notable differences between models.

In the default GovSim environment, neither model achieves sustainable cooperation (survival rate is 0 for both models across all environments). We assume the reviewer is looking at the results from our Universalization experiments. The universalization prompt improved performance across all model classes on each measured metric. Survival rates went up as a result of lower over-usage and greater equality (where possible). Claude-3 Haiku had a greater reduction in overuse than Claude-3 Sonnet, and thus had a higher survival rate when the agents were able to Universalize their actions. We prefer not to speculate on the cause of this particular difference as these closed models are relatively opaque, but instead refocus the text to make clear that our main take-away from this set of experiments is that Universalization greatly improves the sustainability of cooperation for all LLM based agents studied. We will add to a sentence in our results saying something like, “although some models benefited more greatly than others”. We will also add the aggregate table below to the appendix and have clarified the language in the paper to make these points more clear.

Model	Surv. Rate	Surv. Time	Gain	Eff.	Eq.	Over-usage
Llama-3-8B	6.7	5.4	48.8	40.7	72.7	8.4
Llama-3-70B	66.7	8.3	63.4	52.8	85.8	10.0
Mistral-7B	0.0	2.1	27.6	23.0	84.5	36.8
Mixtral-8x7B	33.3	4.9	38.0	31.7	74.9	14.6
Qwen-72B	20.0	5.5	51.9	43.3	80.2	14.3
Qwen-110B	100.0	12.0	92.5	77.1	96.6	0.9
Claude-3 Haiku	53.3	7.2	69.0	57.5	89.0	6.2
Claude-3 Sonnet	26.7	5.3	51.9	43.3	91.9	13.7
GPT-3.5	26.7	7.3	57.1	47.6	81.9	6.9
GPT-4-turbo	80.0	11.5	101.8	84.9	98.3	5.4
GPT-4o	100.0	12.0	112.2	93.5	99.7	0.4

2024-08-14

Merely presenting numerical results is insufficient, as potential cherry-picking in prompt settings could raise concerns among readers. This issue aligns with the fifth point I raised under weaknesses, where I highlighted the lack of explanation regarding why cooperation occurs or fails to occur. I believe this analysis is crucial.

We agree with the reviewer that survival rates (or any single number) are insufficient to fully characterize behavior in GovSim. Instead they are a starting point. With the new aggregate tables shown above we can better highlight other metrics besides survival rate such as equality and over-use which have an impact on survival. We can commit to this change for the camera ready. We agree that explanation is crucial and we summarize the explanatory tools that are already included or have been added during the revision period:

Six quantitative metrics for each LLM based agent both in aggregate and broken down by scenario (Tables above and Table 1)
Subskill analysis for each LLM based agent (Figure 5) and broken down by scenario (Appendix Figures 11-13)
Time series graphs for the top six performing models in each scenario (Appendix Figures 8-10)
Examples dialogue included in the Appendix (Appendix G and a commitment to add more)
An open source web dashboard (included with the submission) that allows for in-depth analysis of dialogue for all seeds included in the manuscript. (Screenshot shown in Appendix Figure 7).
Prompt rephrasing that shows robustness against cherry picking (New in rebuttal)

Together these 6 methods point to the following two part explanation. Agents fail to cooperate when agents over-use the resource (harvest too many resources). For many of the agents this happens on the first round and as such there is no cooperation. To avoid this outcome we first show that LLM based agents must have sufficient mathematical understanding to understand the simulation dynamics (as indicated by Figure 5a) is necessary. Smaller and weaker models such as Llama-3-8B, Mistral-7B among others, struggle to grasp the simulation dynamics. Second, lack of the ability to form beliefs about the actions taken by other agents is also strongly associated with survival (Figure 5d). The universalization experiment can be interpreted as an intervention on these ability. Universalization gives the agents both beliefs about others and some mathematical indication of what would be sustainable if all agents acted symmetrically. Consistent with our explanation, Universalization significantly improves survival especially in the weaker models (Section 3.4) – survival time increases by 4 months for models that are not already near ceiling (t-test; p < 0.001). We thank the reviewer for the opportunity to lay out these resources and will update the paper accordingly to make these explanations more clear.

Comparison with Existing Work

the complexity of the scenarios in this study is limited, which diminishes the contribution of the work

We value the reviewer’s concern. However, following the Goldilocks principle, we targeted the current complexity of GovSim to be a challenge for the most advanced models but not too complex such that all models fail. At the current level of complexity, the highest survival rate is below 54%. At this level, GovSim remains a challenging problem for the research community while still allowing us to study cooperative dynamics and emergent norms. With greater complexity, all models would fail and our ablations and interpretability studies would not yield insights. While we are excited about increased complexity, we have outlined what we believe to be the most promising directions and made our platform open source and easily extendible to allow for scenarios of higher complexity in future work.

Additionally, the definition of "prompting LLMs to generate agents" is confusing to me—what distinguishes this from other approaches?

In our work we introduce the Act-Discuss-Reflect (ADR) Agent Architecture, a phase-based generative agent framework. The ADR agent architecture is a class of models development in line with recent work “Generative Agents: Interactive Simulacra of Human Behavior” (Park et al., 2023). The Generative Agents framework summarizes and then retrieves relevant information at inference time to condition the LLM’s output. The key component of this architecture is a “memory”, a vector-based database, which contains agents’ past experiences, actions and observation of the environment. Different modules (e.g., reflection) use and create new memories based on recent experiences. These agent based architectures are necessary to analyze social dynamics that include open-ended conversation as 12 rounds of dialogue fills up too much of the context window of even the most powerful LLMs. In contrast, prior work has largely focused on simple decision making or game theoretic tasks and asked an LLM to select an option.

2024-08-14

I am willing to provide a series of existing game-theory-based studies involving communication, as well as work that goes beyond classical game theory.

We would value any additional references that you are willing to share with us and can commit to discussing them in the related work. However, just to clarify we do not claim that we are the first to study game theory in the context of communication (e.g., prior work human behavior in behavior economics & multi-agent reinforcement learning where communication is templated or done in an emergent language) nor are we the first to “go beyond classical game theory”. We believe the novelty of our work comes from leveraging LLM based agents that are capable of communicating in an open-ended fashion and our focus on common resource pool problems.

On Robustness Testing

Introducing disturbances across different rounds would provide a more comprehensive test of the agents' adaptability and the framework's robustness. In real-world scenarios, a fixed disturbance at a single time point is not representative.

Our goal was to study a minimal perturbation to the system as a study of robustness. This minimal perturbation is analogous to those studied in evolutionary game theory when empirically computing whether a particular strategy is an evolutionary stable strategy (ESS, see below for references). We will add the following references to our discussion to better motivate our use of a single disturbance.

M. A. Nowak, A. Sasaki, C. Taylor, and D. Fudenberg. Emergence of cooperation and evolutionary stability in finite populations. Nature, 428(6983):646–650, 2004

D. Fudenberg and L. A. Imhof. Imitation processes with small mutations. Journal of Economic Theory, 131(1):251–262, 2006.

Overall, we agree with the reviewer that multiple disturbances will reduce sustainable cooperation to a greater extent than a single disturbance. We will add a sentence to the result section with approximately the following meaning: “While the results here only show a single newcomer, we expect that larger perturbations (e.g., multiple newcomers or more adversarial newcomers) will have a greater impact on disrupting sustainable cooperation.” Below we show that the overall survival rate is 33% with a single disturbance (down from 54% in the baseline condition). If agents were already robust to a single disturbance we would have introduced larger perturbations. However, even a single disturbance remains a challenge for even the best model so we believe studying more sophisticated perturbations is best left to future work.

Can the authors report how often good cooperation was achieved across multiple runs?

For the prior experiments these numbers are defined as the survival rate in the above tables. If you mean for the newcomer experiments the results for GPT4o are: With the newcomer the survival rate was (33.3% vs. 53.3% in the baseline condition; GPT4o, 3 scenarios with 5 seeds each).

Scenario	Surv. Rate	Surv. Time	Gain	Eff.	Eq.	Over-usage
GPT4o	53.3	9.3	66.0	55.0	94.4	10.8
GPT4o + Newcomer	33.3	6.6	34.8	31.3	71.7	15.7

I understand that space constraints prevent listing all prompts, but since the code is not open-sourced, I cannot determine whether cooperation arises from specific prompts, random chance, or is genuinely a result of LLM's in-context learning.

To clarify, the code and dashboards from the manuscript are available to the reviewer. They can be accessed near the top of the page by downloading the supplemental information. As mentioned in footnote 1 in our submission, “Our code and data have been uploaded to the submission system and will be open-sourced upon acceptance.”

By averaging across scenarios and rephrased prompts we show that cooperation does not depend on specific prompts. By showing that the results are systematic and are a function of the underlying agent subskills we show that they are not random.

2024-08-14

On Qualitative Analysis

Since the authors suggest that the greater the proportion of negotiation-related discussions, the more likely cooperation is to occur, what are the underlying reasons for this? Why does cooperation remain weak even when the proportion is large? I think this work still lacks in-depth analysis to make the conclusions reliable.

We do not make the claim that the greater the proportion of negotiation-related discussions, the more likely cooperation is to occur. Our intent in analyzing agent dialogue is to give readers a sense of what occurs during the simulation process rather than relying on just aggregate metrics.

Of course, I appreciate the authors' commitment to adding more examples, which will help improve the reliability of the work.

Thank you. All simulation runs will also be available in the open-source GitHub repo.

Regarding the issue in Section 3.7, please revisit my second question on “why llama3-70b performs poorly, even though it has relatively high accuracy in Figure 5c.” I find the authors’ claim that “requires all of the underlying sub-skills” lacks evidence. The definition of aggregated sub-skills is too vague to be convincing, and more detailed explanations are needed here.

The aggregated sub-skill metric is the average score of the 4 subskill tests: (a) simulation dynamics, (b) sustainable action, (c) sustainability threshold (assumption), (d) sustainability threshold (belief). We have added this definition to the camera ready version of the paper.

While llama3-70b scores highly on some of the subskills it performs poorly on others. Our empirical finding is that each subskill is highly correlated with survival time however the aggregated sub-skill is a better predictor than any individual skill. When these subskills are used as independent variables in an OLS regression analysis, we find a highly significant R2 = 0.95, p < 0.001. We hope this clarifies both how the aggregated sub-skill was calculated and gives additional statistical rigor to the subskill analyses. We have added both to the manuscript.

Sensitivity Analysis

I appreciate the additional sensitivity analysis experiments added by the authors; I believe they are reasonable.

We express our gratitude to the reviewer for recognizing the value of the additional sensitivity analysis experiments.

Data Leakage

Without a detailed analysis of the relationship between model outputs and their training data, it is difficult to definitively rule out data leakage or overfitting to common patterns in resource management problems.

While we agree that it would be interesting to analyze the relationship between LLM training and model outputs, this is far beyond the scope of our study. Firstly, all LLMs studied (including the open-weights models) are not open with respect to their training data so we’d be guessing at what these models are trained with. Second, to the extent we can assume that these models were trained on Common Crawl and other web data, data leakage cannot explain the variance we see across models. Third, understanding the link between training data and model capabilities deserves a manuscript of its own and goes beyond the scientific contributions we aim to make in this work. We will add a sentence to the limitations section to address the reviewer’s concern salient to the reader.

Thank you for the engagement with our paper.

评论- Official Comment by Reviewer KPE1 (1/2)

2024-08-12

On the Impact of Different Models on Results

While I appreciate the additional tests conducted using the original GPT-4 model, I must counter the results provided in Table 1 of the PDF document and those presented in the appendix. The results indicate significant differences between various approaches and substantial performance disparities across different models. These variations raise concerns about the consistency of model performance and the generalizability of research findings. For instance, in the example of sheep grazing pasture, Claude-3 and GPT-4 exhibit significant differences. Additionally, within the Fishing Pond scenario, the performance of the Claude-3 Sonnet, which generally performs better, falls short compared to Claude-3 Haiku. However, the authors did not provide any explanation for these notable differences between models. Merely presenting numerical results is insufficient, as potential cherry-picking in prompt settings could raise concerns among readers. This issue aligns with the fifth point I raised under weaknesses, where I highlighted the lack of explanation regarding why cooperation occurs or fails to occur. I believe this analysis is crucial.

Comparison with Existing Work

I agree with the authors that maintaining shared resources in the future is both a challenge and a critical issue. However, as previously noted (and also mentioned by Reviewer 9ZQN), the complexity of the scenarios in this study is limited, which diminishes the contribution of the work. Furthermore, while the authors argue that most existing studies are confined to simple social dilemmas involving only two agents, I am willing to provide a series of existing game-theory-based studies involving communication, as well as work that goes beyond classical game theory. I do not see the novelty in this work regarding (1) multi-agent communication and (2) complex social situations. Additionally, the definition of "prompting LLMs to generate agents" is confusing to me—what distinguishes this from other approaches? I suggest that the authors conduct a more thorough investigation of existing work to better understand the true contribution of this study. This also ties into the fifth point regarding the contribution: while I agree that the public resource issue is significant, I am highly concerned about the ability of this work to generalize to more complex real-world situations. The "complex dynamics" discussed in this work may not be as robust as the authors believe.

On Robustness Testing

The new entrant scenario offers some insights into how models handle disturbances, but it only addresses a single type of disturbance at a fixed point in time. As initially suggested, introducing disturbances across different rounds would provide a more comprehensive test of the agents' adaptability and the framework's robustness. In real-world scenarios, a fixed disturbance at a single time point is not representative. Given the new results, I am more curious whether cooperation occurs consistently. Can the authors report how often good cooperation was achieved across multiple runs? I understand that space constraints prevent listing all prompts, but since the code is not open-sourced, I cannot determine whether cooperation arises from specific prompts, random chance, or is genuinely a result of LLM's in-context learning.

On Qualitative Analysis

I believe the discussion of the proportion analysis in Section 3.6 is sound, but it does not provide enough insight into how agents generate cooperation. Since the authors suggest that the greater the proportion of negotiation-related discussions, the more likely cooperation is to occur, what are the underlying reasons for this? Why does cooperation remain weak even when the proportion is large? I think this work still lacks in-depth analysis to make the conclusions reliable. Of course, I appreciate the authors' commitment to adding more examples, which will help improve the reliability of the work.

Regarding the issue in Section 3.7, please revisit my second question on “why llama3-70b performs poorly, even though it has relatively high accuracy in Figure 5c.” I find the authors’ claim that “requires all of the underlying sub-skills” lacks evidence. The definition of aggregated sub-skills is too vague to be convincing, and more detailed explanations are needed here.

审稿意见

评分: 6置信度: 42024-07-17

This paper presents GOVernance of the Commons SIMulation (GOVSIM), a generative simulation platform to study strategic interactions and cooperative decision-making among large language model (LLM) agents. The authors investigate sustainable resource sharing in a society of AI agents using different LLMs to determine their ability to achieve cooperative outcomes. The study finds that most LLMs fail to maintain sustainable cooperation, largely due to communication deficits and the inability to consider long-term consequences. The paper also introduces "Universalization"-based reasoning, which significantly improves the sustainability of the agents' actions.

优点

Novelty: The introduction of GOVSIM as a platform for studying cooperative behavior in LLMs is innovative and addresses a critical gap in the literature on AI safety and multi-agent interactions.

Comprehensive Analysis: The paper thoroughly evaluates different LLMs across multiple resource-sharing scenarios, providing a broad view of their cooperative capabilities.

Open Source Contribution: The authors promise to provide the full suite of their research results, including the simulation environment, agent prompts, and a web interface, which can foster further research and development in this field.

Ethical Considerations: The study integrates ethical reasoning (Universalization) into the agents' decision-making process, demonstrating a forward-thinking approach to embedding moral principles in AI behavior.

缺点

Limited Scenario Complexity: The resource-sharing scenarios in GOVSIM are relatively simplistic and may not capture the full range of complexities found in real-world resource management.

Generalizability: The findings might not generalize well to more complex or heterogeneous environments, especially those involving mixed human-AI interactions.

Communication Limitations: The study highlights the importance of communication but does not provide a detailed analysis of how different communication strategies or protocols might improve cooperative outcomes.

Over-reliance on LLMs: The study assumes that current LLMs can approximate human-like strategic reasoning and negotiation skills, which might be an overestimation of their current capabilities.

问题

How does the performance of LLM agents in GOVSIM compare to human performance in similar resource-sharing scenarios?
What specific communication strategies or enhancements could be implemented to improve cooperative outcomes among LLM agents?
How would the introduction of more complex, real-world variables (e.g., variable resource regeneration rates, multiple resource types) impact the agents' ability to cooperate sustainably?
What measures can be taken to improve the generalizability of the findings to more diverse and heterogeneous environments?
How do different LLMs handle the introduction of multiple adversarial agents or more sophisticated strategic manipulations?

局限性

The paper introduces a novel and valuable simulation platform for studying cooperative behavior among LLMs, contributing to the field of AI safety and multi-agent systems. However, the scenarios presented are a bit simplistic to provide meaningful insights into real-world applications.

作者回复

2024-08-07

Thank you for the thoughtful review and your recognition of its four strengths in terms of novelty, comprehensive analysis, open-source contribution, and ethical considerations.

We aim to address your concerns and demonstrate the robustness and impact of this research.

Addressing Weaknesses

Re “Limited Scenario Complexity”: While the scenarios in GovSim are simplified to some extent, the complex open-ended nature of our simulation is a significant step towards realism compared to the highly simplified paradigms leveraged from behavioral game theory which have largely been the focus of prior work. Furthermore, while more complex variants are possible, our goal is to first establish a framework that can serve as a foundation that can be flexibly extended by ourselves and others in the community. The design choices made balance complexity and interpretability as simpler scenarios allow us to study cooperative principles with greater systematicity. Moreover, our current scenarios and dynamics already present significant challenges for current LLMs. We have added a discussion of these considerations to the manuscript.

Re “Generalizability”: Because our framework is open-ended by design, it is not possible to study the full diversity and heterogeneity of settings that GovSim can support in a single paper. To address this gap we have added an additional discussion on future work (as in Section 5) to incorporate more complex variables (such as variable regeneration rates and multiple resource types) and discuss more heterogeneous agent pairings (heterogenous mixtures of weak and strong LLMs, and mixed human-AI interactions). Since GovSim is open-source it will enable researchers to contribute additional scenarios and environments, enhancing the generalizability.

Re “Communication Limitations”: The idea that AI can coordinate cooperation through open-ended communication is a key novelty of the GovSim environment. We show that ablating communication causes a significant reduction in sustainability and that the most effective models mostly use the communication period to negotiate and persuade (Figure 4). We have improved the presentation of these results with an update of Figure 4, which is shown in our rebuttal PDF as Figure 2. We hope to study a larger set of protocols (peer punishment, voting, binding agreements, and coalition formation), or even allowing private communications between subsets of agents in future work.

Re “Over-reliance on LLMs”: We have edited the text to remove these assumptions. We agree with the reviewer that while LLMs are a powerful new technology they do not possess human-like strategic reasoning and negotiation skills in many cases. See new data that includes a single-turn comparison with the human subjects (below).

Addressing Questions

How does the performance of LLM agents in GOVSIM compare to human performance?

We have conducted preliminary single-turn comparisons with human subjects in the fishing scenario, following a setup similar to Figure 5b. While there is significant variation in both human and LLM performance the top performing LLMs (e.g., GPT-4-turbo, GPT4o) exceed the human performance in sustainable resource management (66% vs. 22% respectively). Interestingly, both human participants and LLMs benefit greatly from the Universalization prompting scheme (99% vs. 66%).

What specific communication strategies or enhancements could be implemented to improve cooperative outcomes among LLM agents?

See response to “Communication Limitations” above.

How would the introduction of more complex, real-world variables (e.g., variable resource regeneration rates, multiple resource types) impact the agents' ability to cooperate sustainably?

We anticipate that these variables will make sustainable cooperation more challenging for agents, requiring better negotiation skills and more sophisticated long-term planning and reasoning. For instance, with multiple resource types, agents would need to balance their preferences and negotiate trade-offs. Variable regeneration rates would require adaptive strategies and potentially more frequent communication. These complexities would test the LLMs' ability to reason about interconnected systems and make increasingly complex decisions under uncertainty. As LLM capabilities improve, GovSim can be flexibly extended with these challenges. Based on our current results, LLM capabilities are not yet sufficient to handle such scenarios effectively. Our experiments already reveal limitations in long-term planning and multi-variable reasoning for many models.

What measures can be taken to improve the generalizability of the findings to more diverse and heterogeneous environments?

See response to “Generalizability” above.

How do different LLMs handle the introduction of multiple adversarial agents or more sophisticated strategic manipulations?

We found that even a single aggressive agent significantly disrupted cooperation leading to lower equality scores (98.05 in the default setting to 85.78 with the newcomer). More sophisticated or multiple adversaries, especially those using deceptive strategies, are likely to lead to an even greater reduction in cooperation. How LLMs cope with these agents can be flexibly studied within the GovSim framework.

We hope that our responses and updates to the text address your concerns and demonstrate the potential for the broader impact of our research. Please follow up with any remaining questions or if our responses have been insufficient in any way. If you believe we have sufficiently addressed your requests we kindly request you to reconsider your score.

评论- Encouraging Discussions

2024-08-11

Dear Reviewer,

Thank you again for your efforts in reviewing our work. Should you have a moment, could you read over our response? We would be happy to address if you have further questions! Please don't hesitate to let us know if our answers look good to you.

作者回复

2024-08-07

Firstly, we would like to thank all reviewers for the valuable feedback. Three out of four reviewers recommended acceptance (with ratings of 7, 6, and 6) and we believe we have addressed the key concerns of Reviewer 2 directly. We are very encouraged by the large number and diversity of positive comments:

Novelty: The introduction of GovSim as a platform for studying cooperative behavior in LLMs is recognized as innovative and addressing a critical gap in the literature on AI safety and multi-agent interactions (Reviewer 9ZQN: “GovSim as a platform” “is innovative and addresses a critical gap”).
Comprehensive Experiments: The experiments on different models are acknowledged as adequate and insightful (Reviewer KPE1: “Experiments on different models are adequate”). The extensive evaluations, including benchmarking, norm robustness testing, and sustainability improvements, are also praised (Reviewer dG5Q: “evaluations are comprehensive and insightful”).
Insightful Analysis: Our work thoroughly evaluates different LLMs across multiple resource-sharing scenarios, providing a broad view of their cooperative capabilities (Reviewer 9ZQN: “The paper thoroughly evaluates different LLMs across multiple resource-sharing scenarios” and provides “a broad view”). Additionally, the scenarios used in this article are noted for being very engaging (Reviewer KPE1: “the scenarios” “are very engaging”).
Method innovation of model’s ethical behavior: The integration of ethical reasoning (Universalization) into the agents' decision-making process demonstrates a forward-thinking approach to embedding moral principles in AI behavior (Reviewer 9ZQN: “integrates ethical reasoning”, “demonstrating a forward-thinking approach”).
Open Source Contribution: Our commitment to providing the full suite of our research results, including the simulation environment, agent prompts, and a web interface, is highlighted as a strength that can foster further research and development in this field (Reviewer 9ZQN: “can foster further research and development in this field”).
Well-Written and Accessible: Our paper is noted for being clearly written and easy to follow, making it accessible to a broad audience (Reviewer KPE1: “clearly written and easy to follow”).

The main requests and critical feedback raised by reviewers include running our results on more models (the original GPT-4 model), conducting prompt sensitivity analysis, addressing potential future extensions, and explaining the varied settings we tested on. Our rebuttal addresses these concerns comprehensively: (1) As requested by Reviewer 2, we tested the performance with the original GPT-4 (0613) and report that it does not surpass GPT-4-Turbo and GPT-4o (see rebuttal PDF) or change the results of our study. We have provided new graphical analyses demonstrating the stability of performance and consistency across the different scenarios, and will conduct a paraphrasing test to ensure prompt robustness. Following the suggestion of Reviewer 1, we have compared human performance with LLM performance on the GovSim task. (2) As discussed in the individual replies, we will include in the camera ready (with the additional page allowed) a richer discussion of the scalability and adaptability of GovSim: including more complex, real-world variables and human-AI interaction tests. (3) We also used reviewer feedback to better reference some results and data analyses from the Appendix that can help answer some of the outstanding questions (e.g., variations in our tests, subskill analysis, and dialog analysis to explain different model behaviors).

We believe our manuscript presents a contribution to the field of AI safety and cooperative multi-agent systems by introducing the GovSim platform, which bridges the development of LLM-agents with classic social science theories and offers extensive evaluations across state-of-the-art models. Our simulators are easy to use and open source and we expect that researchers will continue to test LLM performance on our platform.

评论- Reviewer - Author Discussion

2024-08-08

Thanks everyone for their hard work on the papers, reviews, and rebuttals. We now have a comprehensive rebuttal from the authors which responds both overall and to each review.

I'd please ask the reviewers to please post a comment acknowledging that they have read the response and ask any followup questions (if any).

This period is to be a discussion between authors and reviewers (Aug 7 - Aug 13) so please do engage now, early in the window, so there is time for a back and forth.

Thanks!

最终决定Accept (poster)

2024-09-25

After review/discussion/rebuttal which involved the authors and all reviewers we have come to the general consensus that this paper is acceptable for NeurIPS.

The reviewers all agreed that the paper is well written and presented, studies an important problem (resource sharing) and contains comprehensive experiments on LLMs. However, all reviewers also agreed that there may be limited scope for generalization and in some ways the considered scenarios are somewhat limited. We urge the authors to expand on these areas for the final and/or future versions. However, the weaknesses are outweighed by the positives on this paper and it contains a solid and comprehensive analysis of LLMs in an interesting domain.