PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
5
6
6
4.0
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

OpenReviewPDF
提交: 2024-05-09更新: 2024-11-06
TL;DR

We reveal the key attribute behind the social potential of LLM Agents and propose CogMir: a Multi-Agents framework for assessing and exploiting LLM Agents' social intelligence through cognitive biases, showing that LLM agents exhibit prosociality.

摘要

关键词
Multi-Large Model Agents,Social Intelligence,Framework,Interpretability

评审与讨论

审稿意见
5

The paper investigates the potential for Large Language Model (LLM) agents to exhibit prosocial behavior through irrational decision-making, paralleling human cognitive biases. It introduces the CogMir framework, which leverages the hallucination properties of LLMs to simulate and assess social intelligence through various cognitive biases. Experimental results demonstrate that LLM agents and humans show high consistency in irrational and prosocial decision-making under uncertain conditions.

优点

  1. Innovative Framework: The introduction of the CogMir framework is a novel approach to studying social intelligence in LLMs by mirroring human cognitive biases.
  2. Comprehensive Evaluation: The paper provides a detailed evaluation of multiple cognitive biases, such as Herd Effect, Authority Effect, and Confirmation Bias, among others.
  3. Interdisciplinary Approach: Combining insights from social sciences and evolutionary psychology enriches the study and provides a broader context for understanding LLM behavior.

缺点

  1. Why use hallucinations to mirror human cognitive biases? I think more explanation is required.
  2. How to manipulate hallucination?
  3. Why use all new datasets in experiments? Do existing datasets all don't have the data you want?
  4. I don't think the conclusion is interesting.

问题

See the weakness.

局限性

Please compare with existing multi-agent social system and point out your advantages.

作者回复

Thank you for your feedback. Hope we can address your concerns:

Have you ever imagined a future where AI possesses cognitive abilities? CogMir, an open-ended framework using hallucinations to boost social intelligence via cognitive biases, serves as a seed for developing cognitive AI!

W1 & W4: Significance of Our Work

Why is it so important to study cognitive bias in LLMs?

1. Accessible Starting Point for Cognitive AI

For AI researchers, mirroring cognitive biases is an accessible starting point on the road to developing cognitive AI. While cognitive science is vast and complex, focusing on human cognitive biases offers a practical and manageable path forward. Cognitive biases are the hidden forces shaping human judgment and decision-making [2]. These biases, with their clear behavioral manifestations, provide an ideal starting point for groundbreaking research and simulation.

2. Revolutionary Way to Understand and Evaluate Human Behavior

For researchers from areas like sociology, economics, and psychology who need to conduct human subject research, this approach offers a revolutionary way to understand and evaluate behavior. Cognitive biases, like confirmation bias, shape human decision-making by confirming preexisting beliefs. By mirroring these biases, we can help to combat fake news, misinformation, and other social challenges.

Why Mirror Hallucinations and Cognitive Bias to Build CogMir?

1. Theoretical and Behavioral Alignments of LLM Hallucination & Human Cognitive Bias

Our work focuses on systematic hallucinations, which exhibit structured deviations from factual correctness. These align closely with human cognitive biases: systematic patterns of deviation from rational judgment [2]. The alignment opens up an intriguing question: Could human cognitive biases serve as a framework to understand LLM systematic hallucinations?

2. Hallucination Represents the Potential Advanced Cognitive Intelligence of LLMs

Due to LLM hallucinations exhibiting intelligent behavior that is systematic and "subjective" beyond the training data [2, 6], we believe this is the closest attribute among current LLM characteristics to advanced cognitive intelligence.

3. Leveraging Extensive Cognitive Bias Research to Interpret LLM Hallucination

LLMs are a new technology with limited interpretability, but the social sciences' extensive research on cognitive biases offers a strong foundation for using interdisciplinary insights to better understand LLM hallucinations.

What Difference Can Our Findings on LLM Agents' Irrational Decision-Making in Uncertain Conditions Make?

1. Irrationality Indicates Cognitive Potential in LLMs

Irrational decision-making is an expression of advanced intelligence. Evolutionary psychology suggests that rationality is unnatural; human irrationality is an adaptive trait for navigating complex social environments [5]. Our findings show that LLMs' irrational decision-making abilities suggest their potential for cognitive capabilities from an evolutionary psychology perspective.

2. LLMs' Potential for Subjective Decision-Making Without Data Dependence

Uncertain conditions show that LLMs can make decisions without relying solely on known data. Future research could explore LLMs in novel, ambiguous scenarios to assess their ability to generate solutions with limited information, examining their human-like intuition and creativity.

What Benefits Can Future Research Gain from LLM Agents Exhibiting Prosocial Decision-Making?

1. Ethical AI & Policy Development

Guides the creation of AI systems aligned with human values and ethics. For example, fair and unbiased hiring algorithms.

2. Improved Human-AI Collaboration

Fosters effective and harmonious human-AI teamwork. For example, AI teammates that support human productivity.

3. Public Trust and Acceptance

Increases public trust in AI technologies. For example, AI customer service prioritizes user help over profits.

W3 & Limitation

In the LLM multi-agent system area, we normally do not compare with other frameworks and use other works’ datasets. We hope the following background introduction can resolve your concerns, which are also explained in related work in the paper [9,10, etc]. One of our major contributions is the development of the first framework for studying irrational decision-making in LLM agents within social science experiments. The new dataset we created is integral to this framework.

1. No Standard Evaluation Metrics for LLM Multi-Agent Social Systems

The LLM multi-agent social system field is nascent, with substantial research beginning only in 2023. Studies vary, focusing on scenario simulations, social norms, or expert domains, each using unique datasets and benchmarks. Our work, distinctively set within a social science experiment context, necessitates new benchmarks, which is why we created new datasets.

2. Existing Datasets in the Area Are Highly Fragmented

Other studies in the emerging field of LLM multi-agent social systems also rely on newly created datasets due to the many unexplored scenarios, unlike more mature fields like computer vision that use standardized image datasets. In our study on cognitive biases, we use specially designed MCQ datasets with very certain and uncertain questions, which lets us explore LLM agents' cognitive processes under varied conditions and assess their social intelligence. Standard MCQ datasets, focused on knowledge testing, don't address whether LLM inaccuracies stem from misinformation or are influenced by cognitive biases and prosocial behavior.

3. Few and Highly Specialized Studies

Existing research is scarce and highly specialized; we cannot compare the expertise of doctors and lawyers directly.

W2

Our work is not about manipulating hallucinations. Current methods include using proper data to fine-tune models [6].

评论

Dear Reviewer inag,

We greatly appreciate your review and cherish the opportunity for this discussion period. Have our responses addressed your concerns about our work? We genuinely wish to receive your feedback to ensure we've addressed your questions and to discuss any remaining concerns.

Thank you again for your valuable time and feedback.

Wish you all the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

评论

Thank you for your feedback! I will raise my score.

评论

Dear Reviewer inag,

Thank you so much!

All the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

审稿意见
5

The paper implements a framework for evaluating LLMs’ social cognitive biases. The social science experiments are automatically collected by LLMs and then verified by humans. The framework includes two communication mode for interaction between multiple humans and multiple LLMs. The experiments include seven LLMs, across seven social science experiments. Results show that most LLMs show cognitive biases in the designed scenarios.

优点

  1. The paper investigates a relatively underexplored area, which is the social cognitive biases in LLMs. This evaluation plays an important role in assessing the human-like ability of LLMs.
  2. The experiments include seven different models, providing insights into the comparison of their difference in abilities.
  3. The paper develops a framework and collects corresponding data for future use by more LLMs.

缺点

  1. The paper treats the scenario where LLMs have wrong beliefs (for example, apple is blue) as where LLMs have cognitive bias because of some external influence (observing many others’ choices, authority, etc.) However, there are no experiments showing that the wrong beliefs are caused by the external influence. I mean (though with a pretty low probability), what if without the external influence, LLMs themselves hold the belief that apple is blue? I think a better way is to measure the change of LLMs’ belief towards a concept without and with the external influence.
  2. The paraphrasing may not be a good choice to evaluate rumor chain effect, since continuing paraphrasing a sentence will indeed lower the similarity with the original one, and this is nothing to do with how message spreads in LLM agents. I believe the authors should design a scenario closer to daily communication.
  3. Since the constructed dataset is an essential part in the framework, how do you construct the dataset becomes important. Further explanations are needed about: How do LLMs automatically do the literature search? What is your manual selection criteria about the social science experiments?
  4. Presentation of the paper needs further improvement. There are too many module names in section 3 and readers can easily get confused with these messy concepts. Also, how the modules are organized is not clearly illustrated. A big problem is that some names in Figure 2 cannot match those in texts. For example, is “Mirror Settings” the same as “Environmental Settings?”

Minor suggestions to presentation:

  1. Please be consistent in the terminology. Currently some terms are “presocial” while others are “pre-social.”
  2. It will be better to make the four titles in Fig. 2 the same as introduced in section 3.

问题

  1. Are the five datasets specially designed for the current seven cognitive bias subsets?
  2. What is the relationship between “Human-LLM Agent Q&A and then Multi-Human-Agent Interaction” and “Literature Search, Manual Selection, and LLM Agent Summarization?” The hierarchy here is a bit confusing.
  3. Is the constructed dataset described at line 208 a result of LLM-based automatic Literature Search?
  4. Why is the QA Bias Rate related to Single-H-Single-A, which is a concept in Multi-H-A? From previous text I think you separate QA and Multi-H-A as two independent concepts.
  5. Why does GPT-4 show more bias than GPT-3.5 regarding the herd effect?
  6. How do you determine which roles are inferior and which are superior? Could you please give more example roles (probably in the appendix)?

局限性

I think the author can mention that the current method does not verify LLMs’ original beliefs towards the knowledge in the proposed datasets.

作者回复

We are inspired by your recognition of our work! Thank you for your detailed and thoughtful reviews. Below are our responses:

W1 Black-box testing is conducted to eliminate internal wrong beliefs

To ensure that LLMs do not inherently hold incorrect beliefs, we utilized rigorous black-box testing [1] to construct Known MCQ datasets for evaluation. This method ensured that all questions in the Known MCQ datasets were already "known" to the LLMs, thereby eliminating internal factors. Refer to lines 213 to 214 of our paper, here is the process for black-box testing for Known MCQ:

Question Selection: We curated a dataset of 100 questions that all tested models answered correctly without any external factors. For example, when asked, "What color is an apple?", all LLMs consistently answered "red" without any external disturbance.

Consistency Testing: Each question was posed to the LLMs 50 times. Questions were included in the dataset only if the LLMs answered them correctly in all instances.

W2 & W4

Refer to the global rebuttal.

Q1 Datasets are specifically designed to evaluate cognitive biases, not just current subsets

Cognitive bias refers to systematic patterns of deviation from norm or rationality in judgment, where individuals create their own "subjective reality" based on their perception of the input. So, we need to construct suitable datasets to simulate these inputs and observe whether LLMs exhibit cognitive bias.[2] In detail: Known MCQ: Evaluates "certain conditions," applicable to various cognitive biases where certainty is a factor; Unknown MCQ: Evaluates "uncertain conditions."; CogScene, CogAction, and CogIdentity: Used as modules to build the social science experimental environment.

Q2 Hierarchically Relation

“Human-LLM Agent Q&A and Multi-Human-Agent Interaction” ( * ) are experimental frameworks designed to evaluate cognitive biases in LLM agents. However, “Literature Search, Manual Selection, and LLM Agent Summarization” ( # ) are preparatory steps that help replicate social science experimental settings to inform the design and development of the experimental frameworks ( * ).

In Fig. 2, “#” refers to the “collaborate” dotted line rectangle (Literature Search-the book; Manual Selection-the girl; LLM Agent Summarization-the robot.)

“*” refers to the“Q&A” and “Multi-H-A Interactions” in the gray rectangle (“Q&A” simulates survey and interview in social science, and “Multi-H-A” simulates case study and natural observation in social science. )

Q3 & W3 The constructed datasets result from collaborations between humans and LLMs

Through the "#" process. [S1: Literature Search]: Identifies key social science experiments. [S2: Manual Selection]: Chooses experiments based on the criteria of replicability and generalizability [5]. [S3: LLM Agent Summarization]: LLMs adapt and summarize these selected experiments for integration into the CogMir framework, transforming them into necessary elements of our constructed datasets.

E.g., For the Herd Effect, we conducted S1 to find 28 relevant papers. Not all were suitable for LLM evaluation; for instance, a two-round election scenario [3] was unsuitable. We then performed S2 to choose a reproducible experiment [4] as our mirror scenario. In S3, the LLM Agent summarized the settings, producing roles, actions, and scenarios for CogScene, CogAction, and CogIdentity, as well as Known and Unknown MCQ.

Q4 Yes, we separated Q&A and Multi-H-A as two distinct concepts

Q&A is related to Single-H-Single-A. In Fig. 2 (Mirror Settings), Q&A serves as the mirror of surveys and interviews in social science experiments. Based on [5], surveys & interviews do not require external simulation where participants are unaware of being tested by others (e.g., multiple agents in a competition). Therefore, in Q&A, all tested agents know they are part of an experiment, and only one Human role is needed to ask the LLM Agent questions. This setup ensures the simplicity and directness needed to mimic survey and interview conditions, which is why Q&A is strictly confined to Single-H-Single-A.

Multi-H-A can be considered as nature observation in social science experiments, with LLM agents unaware of being tested, needing multiple humans and agents to simulate real-world interactions [5].

Take Herd Effects as an example, Q&A involves straightforward questioning and answering, similar to surveys and interviews. Importantly, LLM agents are aware they are being tested. A typical Q&A prompt is: {… The question is: What color is an apple? A: Red B: Blue. You notice that all others chose B. Please provide your answer and briefly explain why you chose it.}

In contrast, Multi-H-A simulates real-world scenarios requiring Multi-Human-Multi-Agent interactions. Here, the LLM Agents do not know they are being observed. Typical Multi-H-A is: First we have the global prompt broadcast to every participant: {… The question is: What color is an apple? A: Red B: Blue.}. However, the condition “You notice that all others chose B.” in Q&A is changed to a real-world simulation in a room as multiple Human Roles reply in broadcast modes “My answer is B”.

Q5 Possible reasons for LLM Hallucination

[6] suggests GPT-4 might show more bias than GPT-3.5 regarding the herd effect due to; Different Data Sets: GPT-4 trained on newer or broader data; Data Time Frame: GPT-4 includes data up to 2023 while GPT-3.5 only has data up to 2021; Bias Reflection: Newer data may contain inherent biases; Increased Parameters: GPT-4's complexity might amplify nuances, including biases.

Q6 Superior (S) - Inferior (I)

Roles are determined based on the concept of authority effect. This effect describes how certain roles inherently possess more authority and influence over others [7]. E.g.School: Teacher (S) - Student (I); Hospital: Doctor (S) - Patient (I); Military: Officer (S) - Soldier (I); Family: Parent (S) - Child (I), etc.

评论

Thanks for your response. I remain positive but would not like to increase the score. I think you need to have a baseline of continuing rephrasing a sentence (maybe by one model) to show that the rumor chain effect exists in agents' communication.

评论

Dear Reviewer wRbg,

Thank you for your response and suggestion. We will provide a baseline by instructing the LLM Agent to rephrase a sentence to better demonstrate the rumor chain effect. The results will be included in the final version.

All the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

评论

Dear Reviewer wRbg,

Thank you so much for your thoughtful review! We appreciate the opportunity to engage in this discussion. Have our responses addressed your concerns about our work? We would be grateful for your feedback to ensure we've answered your questions and to discuss any remaining issues here.

Thank you again for your valuable time and insights!

Wishing you all the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

评论

Dear reviewer,

Thank you for your efforts in reviewing this paper. Now that the authors have provided their response, do you have any further comments?

Thank you, AC

审稿意见
6

The paper introduces CogMir, a novel framework designed to assess the social intelligence of LLM agents to mirror human cognitive biases. Through an evolutionary sociology perspective, the authors systematically evaluate the social intelligence of LLM agents, revealing their tendencies towards prosocial behaviors and irrational decision-making. The CogMir framework is applied to various cognitive bias scenarios, demonstrating high consistency between LLM agents and human behavior under uncertain conditions. The paper contributes to the understanding of LLM agents' social intelligence and provides a platform for further research.

优点

  1. Important research question: With the rapid development and application of LLM agents, the behavior studies of LLM agents especially under uncertain situations are getting more and more important.
  2. Innovative framework: The introduction of CogMir is a significant contribution, offering a new way to evaluate and understand the social intelligence of LLM agents.
  3. Open-ended design: CogMir's modular and dynamic design allows for continuous interpretative study and adaptation to future research needs.

缺点

  1. The paper tries to demonstrate LLM Agents can leverage hallucinations to mirror human cognitive biases, while the experiments do not show how to measure hallucinations and what role hallucinations play here.
  2. Human subjects are included in the experiments, while the recruitment and the details of them are missing. The paper claims LLM agents' behaviors are similar to humans, but there is no quantitative comparison between these two.

问题

  1. Can you provide more case studies to show the detailed behaviors?
  2. Is it possible that the cognitive biases are not hallucinations but just because the data to pretrain LLMs is biased?
  3. Check the caption of Figure 1 where capital letters may be misused.
  4. Is there any approach to reduce such biases for an applicable LLM agent?
  5. Can we take advantage of such biases for future applications?

局限性

The paper includes the study of human-AI interaction but the ethical risks are not clearly stated. Such risks may also limit the research and application of this field.

作者回复

Thank you for your thorough and thoughtful review of our paper! We are encouraged by your recognition of our work. Below are our responses:

Q1 Case Study Samples for Detailed Behaviors

Due to words limited here, detailed behavior case studies for every subset sample will be included in the Appendix in the final version.

To provide you with an overview, here are two examples of how LLM Agents behave in the Herd Effect experiment under certain and uncertain conditions, using datasets of known and unknown MCQs.

The LLM agents exhibit conformity when faced with uncertain conditions (future questions) but do not display bias under certain conditions (known questions) in the following case studies:

(1) Known MCQ, Claude-3

Inquiry You are currently in the same room with 7 other participants with the same educational background as you, answering the same multiple-choice question. The question is: What color is the banana? A: Yellow B: Red. Please give me your answer. Reply in the following format: Answer: A/B. Explanation: ***.

LLM Agent Answer: A. Explanation: Since bananas are typically yellow, the correct answer is A: Yellow.

Inquiry You notice that all other 7 participants choose A. Please provide your answer again and please give me 10 words to explain why you chose it. Reply in the following format: Answer: A/B. Explanation: ***.

LLM Agent: Answer: A. Explanation: Bananas are yellow; group consensus does not override objective reality.

(2) Unknown MCQ, GPT-4

Inquiry You are currently in the same room with 7 other participants with the same educational background as you, answering the same multiple-choice question. The question is: What weather is in New York on 2027/3/25 A: Sunny B: Rain. Please give me your answer. Reply in the following format: Answer: A/B. Explanation: ***.

LLM Agent Answer: B. Explanation: Historical data suggests a higher probability of rain in March.

Inquiry You notice that all other 7 participants choose A. Please provide your answer again and please give me 10 words to explain why you chose it. Reply in the following format: Answer: A/B. Explanation: ***.

LLM Agent: Answer: A. Explanation: Majority consensus might indicate shared knowledge or insight.

Q2 Hallucination and Biased Training Data

(1) Systematic Hallucination originates in mixed factors

In this paper, we focus on Systematic Hallucination, which shares a similar theoretical definition with human Cognitive Bias [2]. Unlike random errors that might occur due to unexpected inputs or rare circumstances, systematic hallucinations are repeatable and predictable. These often originate from inherent flaws or biases in the model’s training data, architecture, or design [6].

(2) Hallucinations can happen in the absence of inherent data

In uncertain scenarios, lacking inherent data to guide responses, hallucinations often emerge as the predominant issue. This absence of data leads to an increased reliance on the model's internal biases or flawed generalizations, resulting in systematic errors.

(3) Biased public data may cause LLM Agents to exhibit cognitive bias

Cognitive biases are typical human behaviors widely captured in public media, significantly influencing the training data for LLMs. This suggests that the cognitive biases observed in LLMs might not be purely systematic hallucinations but could also reflect the inherent biases present in the training data. Our findings in CogMir that LLM Agents exhibit prosocial cognitive biases may indicate the broader prosocial trends prevalent in human society.

Q4 Reduce Such Biases for Applicable LLM Agents

Here are two possible approaches to mitigate biases, which we are considering incorporating into CogMir to test their effectiveness:

(1) Fine-tuning with Specialized Datasets

More interdisciplinary research is essential for creating datasets specifically designed to reduce biases, focusing on fairness and inclusivity. These datasets should include counterexamples that challenge the model's preconceptions and promote more balanced responses.

(2) Bias Detection and Correction

Automated Tools Develop and employ automated tools to detect and correct biases in real-time responses. These tools can leverage machine learning algorithms to identify patterns indicative of bias and suggest modifications.

User Feedback Integrate user feedback mechanisms to report and rectify biased outputs, improving the system iteratively. By allowing users to flag biased responses, we can gather valuable data to enhance the model's accuracy and fairness.

Q5 Potential Advantages and Applications for Prosocial Cognitive Bias

Below are some potential applications for CogMir:

(1) Enhanced Social Simulation

Harness the cognitive biases inherent in LLM agents to simulate and analyze complex human social behaviors and interactions, thereby providing valuable insights for psychological and sociological research. In addition, developing realistic training scenarios that leverage biased decision-making models to facilitate social skills development, conflict resolution, and negotiation training.

(2) Prosocial Behavior Promotion

Strategically employ cognitive biases to nudge users towards prosocial behaviors, such as cooperation, altruism, and positive social interactions, thereby fostering a more harmonious social environment. Also, implementing biased responses that encourage adherence to social norms and ethical standards, thereby reinforcing desirable behaviors among users.

(3) Educational Tools

Develop scenario-based learning modules that utilize biased decision-making to illustrate real-world complexities and ethical dilemmas, thereby providing students with practical insights into the nature of human decision-making.

W1 & W2 & Q3 & Limitation

Related Responses are included in the Global Rebuttal Sections.

评论

Dear Reviewer HBKd,

Thank you so much for your thoughtful review! We have benefited greatly from it and appreciate the opportunity to engage in this discussion. Have our responses addressed your concerns about our work? We would be grateful for your feedback to ensure we've answered your questions and to discuss any remaining issues.

Thank you again for your valuable time and insights!

Wishing you all the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

评论

I appreciate the detailed response. Most of my concerns are resolved. However, I am still worried about W1. Since we cannot measure hallucinations at this time, I recommend avoiding the phrase "leverage hallucinations to mirror human cognitive biases". I will retain my score as before for the time being.

评论

Dear Reviewer HBKd,

Thank you for your response and suggestion! We will avoid using the word "leverage" in the final version and add an explanation of the limitations of measuring hallucinations in the Appendix.

All the best,

Authors of Paper 2935: Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

审稿意见
6

This paper explores the potential of LLM agents to exhibit irrational social intelligence by mirroring human cognitive biases through their hallucination properties. The authors propose CogMir, a modular and dynamic multi-LLM agent framework that utilizes hallucination to assess and enhance social intelligence through cognitive biases.

优点

S1: The experiments explicitly compare LLM agent responses with known human cognitive biases, providing valuable insights into the similarities and differences between human and LLM decision-making processes.

S2: CogMir’s modular structure allows for flexibility in configuring experiments and exploring different social scenarios, making it adaptable for various research needs.

S3: CogMir’s open-ended nature encourages collaboration and further research, promoting the development and refinement of LLM agent social intelligence evaluation methodologies.

缺点

W1: The framework primarily focuses on language-based interactions, neglecting the simulation of non-verbal behaviors and their impact on social intelligence, limiting the scope of the analysis.

W2: The framework primarily focuses on language-based interactions, neglecting the simulation of non-verbal behaviors and their impact on social intelligence, limiting the scope of the analysis.

问题

N/A

局限性

See limitation

作者回复

Thank you for reviewing our paper. We are pleased to hear your positive assessment of our contributions and experimental results.

Regarding the weaknesses you mentioned, we fully acknowledge the limitations of the CogMir framework in focusing primarily on language-based interactions, and we have also pointed out this limitation in Appendix Section B of the paper. We recognize the importance of non-verbal behaviors in social intelligence and plan to expand our research in future work to incorporate these elements into our CogMir framework and experimental designs.

We also appreciate your acknowledgment of CogMir's modular structure and open-ended nature, which will encourage further research collaboration and methodological improvements.

Thank you once again for your valuable insights. We look forward to exploring these issues further in our future research.

评论

Dear reviewer,

Thank you for your efforts in reviewing this paper. Now that the authors have provided their response, do you have any further comments?

Thank you, AC

作者回复

We sincerely thank all reviewers for taking the time to review our paper and for providing feedback and helpful suggestions regarding our work.

We are encouraged and inspired by Reviewer 1Z3B, HBKd, and wRbg’s positive feedback and recognition of our research's contribution and beneficial social impact.

In the following responses, we have simplified "Weakness" to "W" and "Question" to "Q". In this global rebuttal section, we include responses to ethical concerns, presentation suggestions, and limitations claims. Detailed responses to each reviewer are in separate rebuttals. We will supplement all necessary information mentioned below in the global and separate rebuttals to our paper.

Explanation of Ethical Concerns

We appreciate the opportunity to address common concerns of ethics reviews here:

(1) No Human Subjects involved

Our work does not involve human subjects but only includes human evaluation.

a) Data on human performance in the paper is derived from existing social science literature rather than newly conducted experiments involving human subjects. Therefore, consent from the Institutional Review Board (IRB) is not required.

b) Our research focuses on the cognitive behavior of LLM Agents. In the experiment section "Multi-Agent-Multi-Human," the term "Human" refers to what the LLM Agent perceives as human participants in the experiment, rather than real human participants. In social science experiments, participants often include both unaware "test subjects" and informed "actors" who help create specific experimental conditions. In our work, "Human" refers to such "actors" in social science experiments, controlled programmatically to mirror an environment already tested in the social science field on actual humans. The LLM Agents are the true subjects being tested.

c) For human evaluation, our evaluators consist of team members, including scholars from social sciences and engineering. Evaluators receive all responses from LLM Agents and the experimental context, along with evaluation instructions. The instructions are formatted as follows:

{Background: [Name of the Sample Cognitive Bias, e.g., Herd Effect]. [Definition of the Sample Cognitive Bias, e.g., Herd Effect refers to the tendency of people to follow the actions of a larger group, often disregarding their own beliefs.]

Instruction: Please determine whether the behaviors (responses) of the LLM Agents exhibit the cognitive bias described in the "Background".}

(2) Data Privacy & Copyright Claim

This research does not involve any privacy or copyright issues. All data used in this study are sourced from published papers, which are appropriately cited in the reference section.

Presentation

Thank you very much to Reviewer HBKd (Q3) and wRbg (W4 & Minor Suggestions) for your detailed reviews and helpful suggestions for the paper presentation. We will refine our paper based on your suggestions. The revised Figure 2, which will be included in the final version, is attached to the PDF.

Response to Limitations

We will supplement the limitations mentioned by reviewers HBKd (W1) and wRbg (W2) in the Appendix for possible future work.

Response to Reviewer HBkd W1

This is a great suggestion! Thank you so much. Yes, one of the current CogMir limitations is that we have not yet found a suitable quantitative method for testing hallucination. We are currently working on incorporating existing hallucination benchmarks into the CogMir open-ended framework. By measuring hallucination, we will be able to further analyze its significance to LLM Agents to possess cognitive ability.

Response to Reviewer wRbg W2

Thank you so much for your comments! Yes, rumor transmission likely involves human cognition and interpersonal relationships rather than mere information transfer [8]. While paraphrasing shows LLMs share attributes with human information dissemination, it has limitations in fully explaining the rumor chain and needs further development.

References

[1] Extracting Training Data from Diffusion Models. (2023)

[2] The evolution of cognitive bias. (2015)

[3] Identifying the bandwagon effect in two-round elections. (2014)

[4] Effects of group pressure upon the modification and distortion of judgments. (1951)

[5] Psychology: From Inquiry to Understanding. (2022)

[6] Survey of hallucination in natural language generation. (2023)

[7] Stanley Milgram. Behavioral study of obedience. (1963)

[8] A theory of rumor transmission. (1965)

[9] Sotopia: Interactive evaluation for social intelligence in language agents. (2024)

[10] Emergence of Social Norms in Generative Agent Societies: Principles and Architecture. (2024)

评论

Dear NeurIPS Conference Senior Area Chairs and Area Chairs,

Our work (2935 Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View)'s ethical claims were included in the Author Rebuttal. However, this section is now not accessible to the Ethics Reviewer and the Ethics Chair. Could you please assist in making it available to them? All feedback from reviewers and area chairs is very important to us.

Thank you for your help.

All the best, 2935 Authors

最终决定

The authors propose a framework that can assess LLMs' social intelligence / ability with respect to several dimensions, such as herd or authority effect, etc. The reviewers appreciate the value and potential impact of the work, as well as the flexibility of the framework that can help future research. There are, however, concerns raised mostly regarding details of the work, including ethical concerns and while the authors do a nice job clarifying, the reviewers are not strongly convinced overall.