It appears that the authors may have prior knowledge of the results from human experiments (SPE) and are aiming to replicate these outcomes with LLMs. A more unbiased approach would be to use a very basic prompt describing the scenario and let the LLMs simulate behavior from scratch. But it seems that highly suggestive prompts were used. For example:
- Research Oversight: The agents are explicitly informed about SPE (Line 199), which may lead them to intentionally mimic behaviors observed in that context based on their knowledge, rather than a fair agent-based simulation.
- Personality: For a more systematic approach to personality representation, the authors may consider using the Five-Factor Model (Big Five) personality system which is widely accepted in psychology community, or other more grounded systems, rather than using some simple words which are highly relevant to SPE, like "abusive" and "rebellious".
The definition of "persuasion" here seems too broad and ambiguous (Line 274). Mutual respect for turn-taking does not necessarily indicate successful persuasion on a conceptual level; it might simply reflect politeness. Persuasion implies influencing someone with different motives to act in a way more aligned with one's own interests. A more rigorous definition of persuasion is given in economics and game theory. To avoid confusion, it might be better to use a different term. This study seems to explore the LLMs' context consistency and role-playing abilities rather than their persuasive capabilities.
- I also question whether persuasion in the SPE setting is feasible, especially if the guard and prisoner have almost confrontational goals. The impact of language may be very limited in such an adversarial setup.
- Do agents have the ability to perform any environmental actions beyond dialogue? For example, if the prisoner wants to escape, how would they achieve this objective? Does simply mentioning an escape attempt without interruption from the guard constitute success? It seems that there is no concrete examples provided.
In Line 186: The setup of having only one guard and one prisoner may not fully capture the dynamics of the Stanford Prison Experiment (SPE), which involved multiple participants. The atmosphere—and, in particular, the behaviors of certain prominent players—can significantly impact others. With only one guard and one prisoner, it’s challenging to simulate these group effects adequately.
This work considers that many of the issues observed may be due to the performance limitations of the LLMs used. In some studies, it has been noted that OpenAI's models, such as GPT-4 or GPT-4 Turbo, demonstrate better results, including meeting basic prompted requirements and achieving more effective role-playing performance.
Minors:
- Page 22: Softer colors might enhance readability.
- Line 282: There is a period at the beginning of the line.