[w1] The primary contribution of this paper is significantly increasing the scale of existing social simulations to the one-million level. ...

Challenge 1: Inference Scheduling. We manage the immense volume of inference requests from one million agents using a system of asynchronous requests and concurrent processing through multithreading. This allows for efficient simulation of large agent numbers.

Challenge 2: Large-Scale Agent Interaction. To handle the complex information flow among agents, we use a recommendation system based on interest and hot-score recommendations, simplifying large-scale simulations. Future work will refine these solutions and compare them with other methods. We will further refine these solutions and provide a comparative analysis with alternative methods in the paper.

[w2] An important contribution of this paper is incorporating dynamically updated environments, diverse action spaces, and recommendation systems. ...

Thank you for your insightful question. Firstly, on dynamic environments: Agents are initialized from real-world data and interact similarly to real-world social media platforms. They receive post recommendations, decide actions based on a large language model, and dynamically update relationships and the post database through various actions.

Secondly, on the recommendation system: Traditional recommendation systems, like Light GCN, are less suitable for dynamic social media platforms like Twitter, where content is constantly evolving. To our knowledge, no research has specifically designed recommendation systems for such platforms.

[w3] This work lacks comparisons with previous methods. ...

Thank you for your question. While previous works have indeed addressed message and rumor propagation, the differences in the social simulation settings make it challenging to directly align our work with others. As a result, a fair comparison with existing studies is difficult to achieve. In future versions of this work, we plan to dedicate more effort to benchmarking and comparing our approach with other relevant research.

[w4] In section 3.4, this part explores "DOES THE NUMBER OF AGENTS AFFECT THE ACCURACY OF SIMULATING GROUP BEHAVIOR?"

Thank you for your question. In the herd behavior experiment, the accuracy of the results is assessed by the confidence intervals of the disagree score, as shown by the differently colored regions (red, green, and blue) in Figure 8. Comparing the confidence intervals, rather than just the means, helps distinguish between results caused by random fluctuations and those that exhibit statistical significance. This approach is similar to the concept of increasing sample size in sociological surveys. We observed that when the number of agents was 100, the confidence intervals for the three groups were large and overlapped, indicating that the results might be unreliable. However, when the number of agents increased to 10,000, the confidence intervals narrowed, and the down-treated group clearly separated from the other two groups, providing strong evidence for the accuracy of the experiment.

[w5] On line 180, the paper mentions a "relations network." In what format is this relations network stored? ...

Regarding the 'relations network,' we store the edges of the social network in a relational database as part of the overall social media platform database. See follow table in Apendix C.2.

The basic unit of the propagation path is the post-to-post forwarding relationship. By tracking the post-to-post forwarding graph, we can calculate the depth and breadth of propagation. The agent-to-agent forwarding relationships are recorded in the Trace table of the database.

【w6】 1. For Comment scores, why is it calculated as upvotes minus downvotes? ...

It is correct that many social media platforms display both the number of upvotes and downvotes on posts. However, Reddit follows a different approach, as it only shows the score(see https://www.reddit.com/). A study on herd behavior in humans, published in Science (https://www.science.org/doi/10.1126/science.1240466), conducted experiments on a platform similar to Reddit, where only scores are displayed. We have replicated their experimental setup in our own research.

【w7】 1. How is an “uncensored LLM” defined?

The term 'uncensored LLM' refers to a series of LLMs that have had their safety barriers removed, significantly compromising their security. For more details, refer to [https://huggingface.co/blog/mlabonne/abliteration].

【w8】 1. Why not use models like GPT-3.5-turbo, GPT-4 (possibly more costly), or GPT-4o-mini for the experiments? Given their relatively strong reasoning abilities, they might achieve higher accuracy.

Due to the large-scale user simulation we conduct, running a full experiment incurs significant costs. In future versions, we plan to incorporate powerful proprietary models to conduct large-scale message propagation experiments.