PaperHub
5.5
/10
Poster3 位审稿人
最低2最高4标准差0.8
2
4
3
ICML 2025

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

OpenReviewPDF
提交: 2025-01-20更新: 2025-07-24
TL;DR

This paper studies Role-Playing Language Agents (RPLAs) for established characters, and presents CoSER, a collection of authentic character datasets for RPLAs, along with open state-of-the-art models and evaluation protocols using such data.

摘要

关键词
Role-Playing Language ModelsLLM PersonaDatasetEvaluation

评审与讨论

审稿意见
2

The paper introduces CoSER, a dataset and framework designed to enhance Role-Playing Language Agents (RPLAs) by simulating established characters using LLMs. CoSER provides a high-quality dataset containing 17,966 characters from 771 books, featuring authentic dialogues, character experiences, and internal thoughts. It also proposes a Given-Circumstance Acting (GCA) method for training and evaluating role-playing LLMs, where models sequentially portray characters in book scenes. The CoSER models, CoSER 8B and CoSER 70B, based on LLaMA-3.1, demonstrate state-of-the-art performance, surpassing or matching GPT-4o in benchmarks for character fidelity and decision-making. The paper highlights CoSER’s impact in training, retrieval, and evaluation of RPLAs, releasing its dataset and models for future research.

给作者的问题

N/A

论据与证据

  1. It asserts that authentic literary dialogues lead to better role-playing LLMs, but does not present direct ablation studies comparing performance on authentic vs. LLM-generated role-playing data. Demonstrating such an effect with controlled experiments would strengthen the claim.

  2. The evaluation relies heavily on LLM-based judges, particularly GPT-4o, which could introduce bias or overfitting to OpenAI’s model responses rather than proving general improvements across diverse assessment frameworks.

方法与评估标准

  1. The paper introduces Given-Circumstance Acting (GCA) for training and evaluation, but it is not well-validated by human raters. Human assessments of role-playing quality (e.g., expert evaluations from literary scholars or acting professionals) would provide stronger validation than LLM critics.

  2. The penalty-based scoring approach for LLM judges could be subject to overfitting to specific LLM biases rather than capturing true persona alignment. More transparent reporting on how rubric-based penalties are assigned would improve reproducibility.

理论论述

N/A

实验设计与分析

N/A

补充材料

I read the supplementary materials and the information is complete.

与现有文献的关系

Some prior work in digital actors and AI-driven storytelling (e.g., work from interactive fiction and game AI communities) could be more directly compared or contrasted with CoSER’s approach.

遗漏的重要参考文献

N/A

其他优缺点

See above.

其他意见或建议

While the paper presents a good dataset and framework for role-playing language agents, its core contributions—focused on character simulation, narrative coherence, and dialogue modeling—seem more aligned with computational linguistics, NLP, or AI for interactive storytelling rather than core machine learning (ML) innovations. The methodology, including Given-Circumstance Acting (GCA) and LLM-based evaluation, primarily builds on existing LLM architectures rather than introducing fundamental advances in ML theory, optimization, or learning algorithms, which are central to ICML. A stronger alignment with ML advancements, such as novel architectures for long-term character consistency, reinforcement learning for persona fidelity, or interpretability in role-playing agents, would make the work more relevant to the ICML audience. Otherwise, it might be better suited for venues like ACL, EMNLP, or AAAI that emphasize natural language processing and AI applications in interactive settings.

作者回复

Thanks for your valuable feedback. We have responded to all your comments and questions in detail, and will revise the paper to:

  1. Compare authentic v.s. LLM-generated role-playing data with experiments.
  2. Experiment with Deepseek V3 and R1 as judges to avoid self-preference bias.
  3. Include human evaluation to validate GCA evaluation.
  4. Compare CoSER with a AI-driven storytelling method.

Please find details below:

Q1: Comparing authentic v.s. LLM-generated role-playing data with experiments.

We recognize this feedback. As suggested, we ran two experiments to show the superiority of authentic data.

1.GCA Evaluation for Dialogues

We apply GCA evaluation to compare authentic (groundtruth) vs. LLM-generated (GPT-4o) dialogues of 100 samples in CoSER Test, without providing groundtruth to LLM judges as reference. We use Deepseek R1 as judges. Results:

DialoguesScore(%)
Authentic85.1
GPT-4o Generated76.2

2.Fine-tuning with Dialogues

We fine-tuned Llama-3.1-8B using authentic or LLM-generated dialogues for 200 samples for 4 epochs, using identical settings as CoSER-8B. We evaluate them via GCA on 100 new samples. Results:

Train DialoguesScore(%)
Authentic55.3
GPT-4o Generated54.7

Results show authentic dialogues surpass LLM-generated data in both quality and training effects. We will include them in our revised paper.

Q2: The evaluation relies heavily on GPT-4o as LLM judges, which could introduce bias or overfitting to OpenAI's model responses rather than proving general improvements across diverse assessment frameworks.

We acknowledge this feedback. We experimented with Deepseek V3 and R1 as judges to evaluate performance of 7 models:

(Columns are judges)

GPT-4oDS R1DS-V3
GPT-3.552.835.940.5
Llama-3.1-8B51.837.236.8
abab753.741.540.4
CoSER-8B56.144.545.9
CoSER-70B57.450.847.7
GPT-4o58.548.446.1
Claude-3.5-Sonnet56.254.840.7

The results validate GPT-4o's bias towards GPT models: R1 and V3 judges prefer CoSER-70B and Claude to GPT-4o. We will include the results in our revised paper.

Q3: GCA evaluation is not well-validated by human raters. Human assessments of role-playing quality would provide stronger validation than LLM critics.

Thanks for this comment. We have conducted human evaluation on GCA simulation of 7 models. Afterwards, we find that (1) GCA evaluation aligns well with human judges, (2) Human evaluation of LLM role-playing is highly time-consuming, highlighting the needs for automated evaluation. We apologize for omitting the settings for space limit. Please refer to our response to Q2 of Reviewer XqGw for details. The results are:

  • Human Evaluation Results
avg_scorewin_rate
GPT-3.53.11710.6
Llama-3.1-8B3.60019.4
abab74.53337.5
CoSER-8B4.56738.6
CoSER-70B6.78386.9
GPT-4o4.96747.2
Claude6.20073.9

The results generally align with GCA evaluation. One difference is: human judges show less preference for GPT models compared to LLM judges (GPT-4o), similar to Q2.

  • Alignment between LLM and Human Judges

(4o, R1 and V3 refers to GPT-4o and Deepseek R1/V3 judges.)

MethodModelMetric(%)
GCA4o68.6
GCAV365.1
GCAR177.5
w/o gt4o64.3
w/o gtR177.2
w/o rb4o65.1
w/o lc4o64.5
w/o ds4o65.2
BLEU-75.3
ROUGE-L-72.0

Our results show that GCA evaluation aligns with human judges, with gt, rb, lc, and ds being indispensable components.

Our annotators find that (1) GCA simulation thoroughly reflects LLM’s role-playing abilities; (2) Manual evaluation is highly difficult. It requires careful learning of complex background and abundant dialogues, and takes 15 minutes to evaluate 7 models for 1 case on average.

Q4: More transparent reporting on how rubric-based penalties are assigned would improve reproducibility.

We recognize this feedback. We will improve transparency by releasing the LLM-identified flaws, and adding detailed examples and analysis in our appendix.

Q5: Prior work in digital actors and AI-driven storytelling could be more directly compared or contrasted with CoSER’s approach.

As suggested, we compare GCA simulation with stories generated by HollmWood [1]. We use 30 samples from CoSER Test, apply GPT-4o as the actors/writers, and report average scores assessed by GCA evaluation.

Methodavg_score
GCA59.2
HollmWood50.2

Results show GCA simulation produces more authentic and human-like character interactions compared with HollmWood. We will include the results in our paper.

Reference:

[1] Jing, C., et al. HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing

审稿意见
4

The paper introduces CoSER, a framework and accompanying dataset for training and evaluating large language models (LLMs) for role-playing fictional literary characters. To this end, the authors create a dataset derived from authentic dialogues and character contexts across a large amount of 771 books and ~18,000 characters. A "Given-Circumstance Acting" (GCA) training method emphasizing accurate character portrayal. Their CoSER is evaluated using multi-agent simulations and penalty-based LLM critic scoring, assessing character fidelity, coherence, and alignment. Furthermore, they fine-tune two fine-tuned LLaMA-3.1-based models (CoSER-8B, CoSER-70B), and show that they outperform different baseline models in role-playing benchmarks.

给作者的问题

Which limitations exist, or potential improvements could be done regarding the chunk-based data extraction process? Where did the CoSER approach work best, and where didn't it work so well?

论据与证据

The authors release a high-quality dataset, which uses literary texts to ground the synthetic generation; contrasting purely synthetic datasets, the evaluation shows that the dataset quality is superior. However, work on persona generation, especially grounding synthetic persona generation in existing text has been carried out before [1] . Accordingly, while the CoSER approach goes beyond [1] and focuses more on multi-turn and fine-grained interactions, this potentially hampers novelty and I find it important to compare to [1] to elaborate more on the differences.

GCA training using multi-agent simulation dialogues is evaluated via LLM critics and N-gram overlap metrics (BLEU, ROUGE-L). I find this convincing to highlight the high quality.

Evaluations against baseline LLM demonstrate improvements in CoSER models' multi-turn and character fidelity performances. The evidence is generally convincing, though reliance on LLM-based critics introduces potential biases partially mitigated by supplementary metrics.

[1] Chan, X., Wang, X., Yu, D., Mi, H., & Yu, D. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas (arXiv:2406.20094). arXiv. http://arxiv.org/abs/2406.20094

方法与评估标准

The methods and evaluation criteria are appropriate and well-motivated. The systematic extraction of authentic dialogue data gives a fine-tuned LLM an in-depth understanding using where GCA.

Multi-agent simulations offer more realistic role-play evaluation and penalty-based LLM critics supplemented by traditional metrics help address evaluator bias and make the evaluation more robust.

理论论述

There are no direct theoretical claims to be proven. The motivation for role-playing datasets to improve language models is quantitatively evaluated and grounded in existing literature.

实验设计与分析

I find the experimental design sound and clearly outlined:

Fine-tuning strategy with transparent hyperparameters and data splits with robust comparisons against multiple open-source and proprietary LLMs. A novel multi-agent simulation aims at robust evaluation with thoughtful handling of dialogue length bias. Potential concerns about reliability of LLM critics are partially acknowledged by the authors.

补充材料

The supplementary materials include data extraction prompts, training schema and templates and LLM critic details. I found them to support the claims of the paper.

与现有文献的关系

I find that generally the related work is covered well. Related work includes using LLMs as (role-playing) agents, and agent simulations, as well as using LLMs as judges.

遗漏的重要参考文献

I am missing some related work on persona-based (dataset) generation and persona-based evaluation, e.g., [1], [2].

[2] Liu, A., Diab, M., & Fried, D. (2024). Evaluating Large Language Model Biases in Persona-Steered Generation (arXiv:2405.20253). arXiv. http://arxiv.org/abs/2405.20253

其他优缺点

Strengths: I find the paper well-written and well-presented and welcome the model and dataset releases.

Weaknesses: A human evaluation would support the high quality beyond LLM critics.

其他意见或建议

It would be interesting to explore dataset expansion by human feedback.

作者回复

We sincerely appreciate your thorough review and valuable comments. We have responded to them in detail, and will revise the paper to:

  1. Add more related work on persona-based LLMs.

  2. Include human evaluation to further validate our models and evaluation methods.

Please find the details below:

Q1: Difference compared with Persona Hub [1]

CoSER differs from [1] in:

  1. Goal and Focus:
CoSERPersona Hub
TargetSimulate established personasSynthesize instruction data, knowledge distillation
Persona FocusDepth and richness of character dataBreadth of persona types
Key AbilitiesAnthropomorphism, character fidelity, multi-character interactioninstruction following
  1. Dataset Quality:
CoSERPersona Hub
SourceBook dialoguesGPT-4o synthesis
PersonaBook charactersSynthesized personas
Persona DataProfile (Long), Dialogue, Experience, ...Profile (Short)
PrioritizeQuality and authenticityQuantity and diversity

We'll add detailed discussion in our paper.

Q2: A human evaluation would support the high quality beyond LLM critics.

Thanks for your suggestion. As suggested, we conducted human evaluation on GCA simulation of 7 models. Results confirm that (1) CoSER models are preferred by human judges; (2) GCA evaluation aligns well with human judges.

  • Settings

    • Models: CoSER-8B/70B, abab7, Llama-3.1-8B, GPT-3.5/4o, and Claude (3.5-Sonnet).
    • Data: 60 samples from CoSER Test.
    • Annotations: 3 annotators scored the models (1-10 scale) across 20 samples, given background context, groundtruth dialogues, and scoring rubrics.
    • Metrics: models' average scores and win rates v.s other models.
  • Results

avg_scorewin_rate
GPT-3.53.11710.6
Llama-3.1-8B3.60019.4
abab74.53337.5
CoSER-8B4.56738.6
CoSER-70B6.78386.9
GPT-4o4.96747.2
Claude6.20073.9

The results generally align with GCA evaluation, confirming CoSER models's superior performance. Annotators noted that GCA simulation better reflects LLMs' role-playing abilities than previous methods. There is one difference: human judges show less preference for GPT models compared to LLM judges (GPT-4o). While LLM judges prefer GPT-4o to Claude and GPT-3.5 to Llama-3.1-8B, the results are opposite for human judges (and also R1 as judge, in our response to Q2 of Reviewer vmSk), which likely stems from "self-preference" bias in LLM judges [5].

We further study alignment between LLM judges and human judges using the annotations.

  • Settings:

    • Judge models: GPT-4o (4o), Deepseek V3 (V3) and R1 (R1).
    • Judge methods: Standard GCA evaluation, and ablation variants removing: (i) groundtruth as reference (gt), (ii) rubrics (rb), (iii) length correction (lc), and (iv) dimension separation (ds). Plus BLEU and ROUGE-L.
    • Metrics: We measure how often LLM judges agree with humans when comparing two models, removing model pairs where judges assign similar scores to both models.
  • Results

MethodModelMetric(%)
GCA4o68.6
GCAV365.1
GCAR177.5
w/o gt4o64.3
w/o gtR177.2
w/o rb4o65.1
w/o lc4o64.5
w/o ds4o65.2
BLEU-75.3
ROUGE-L-72.0

Our results show that: (1) gt, rb, lc, and ds improve LLM judges in GCA evaluation; (2) Reasoning models excel as judges - Deepseek-R1 (77.5%) surpasses V3 (65.1%) and 4o (68.6%); (3) BLEU and ROUGE-L remain effective.

Q3: Missing related work on persona-based LLMs, e.g., [1], [2].

Thanks for this feedback. We'll include more related work on persona-based LLMs in our paper, including [1-4].

Q4: Explore dataset expansion by human feedback.

Yes. Our data pipeline has been iteratively improved with human feedback, and we'll continue to explore dataset expansion.

Q5: Limitations and potential improvements for chunk-based data extraction.

The key limitations are: (1) plot fragmentation: splitting a plot into different chunks and (2) limited context, without long-term knowledge of books. Potential improvements include: (1) iterative, plot-aware chunking (implemented in CoSER) and (2) long-term memory for LLM extractor.

Q6: Where did CoSER approach work best, and where didn't it work so well?

CoSER excels at its high-quality data in rich types and nuanced evaluation. However, its extraction recall remains suboptimal: it doesn't capture all character knowledge. We'll address this in future work.

Reference:

[3] Deshpande, A., et al. Toxicity in chatgpt: Analyzing persona-assigned language models.

[4] Park, J. S., et al. Generative agents: Interactive simulacra of human behavior.

[5] Wataoka, K., et al. Self-preference bias in llm-as-a-judge.

审稿意见
3

This paper introduces a large-scale role-playing dataset Coser. The Coser dataset is extracted from 700 renowned books featuring 18k characters. The authors build Coser with the goal of leveraging this dataset as a high-quality resource for given-circumstance acting in large language models. Two fine-tuned models based on Llama were also introduced, achieving state-of-the-art performance on role-playing benchmarks, comparable to propertery models.

给作者的问题

See weaknesses.

论据与证据

Yes, most of the claims are supported by good evidence. The Coser dataset is well justified by the techical details provided and the high quality data from the reowned books. Evaluations are good.

方法与评估标准

Yes, they do make sense. The methods are well-suited for evaluating role-playing language agents by focusing on multi-character simulation and authentic character portrayals. Most of the evaluations are developed methods in the community.

理论论述

N/A. No theoretical claims made in this paper.

实验设计与分析

Yes, they sound good. Again, most of the evaluations are developed methods in the community and not as part of the contribution of this work.

补充材料

Yes, I have read the supplementary material. They are good and I think I can find sufficient details for reproducing the method and main experiments.

与现有文献的关系

This paper can be related to the chatbots/HCI community. It is more like an application-oriented paper.

遗漏的重要参考文献

Yes, I think related works are well discussed.

其他优缺点

Strengths:

  • This paper is well-motivated, focusing on the problem of scaling up role-playing with carefully crafted datasets. The authors collected thousands of character role-playing data from renowned books, I appreciate the effort and the data collection pipeline here. The two Llama-based role-playing llms are also potentially useful for practical applications.

  • The proposed dataset shows good performance when used for developing role-playing agents, the coser 70b model achieved state-of-the-art performance surpassing property gpt models.

Weaknesses:

  • One weakness is the generalizability of the findings. It focuses on a strong application-oriented problem of role playing in NLP. I wonder what is the broader implications of the proposed dataset and model. And what is fundamentally different about Coser from previous role-playing works (there is also a potential improvement for presentation here).

  • Copyright concerns: althought the authors discussed about this in supplementary G, but I am still not sure about whether releasing processed data would by pass the copyright issue. One way of improving this might be releasing the full processing code/pipeline, and let users process on their own book data.

其他意见或建议

N/A.

作者回复

We thank the reviewer for the constructive comments and advice. We have responded to your comments and questions in detail. We hope our responses will help you better recognize the details of our work and findings. Please find the details below:

Q1: Generalizability and Broader implications of our findings, datasets and models.

We believe our work has broader implications in three key areas: (1) alignment and human-like AI, (2) complex instruction following and reasoning, (3) evaluation method for complex open-ended tasks.

  1. Alignment and human-like AI

Generally, role-playing represents a form of anthropomorphism and alignment. It is closely related to several key aspects towards human-like AI, including human-like speaking styles, social intelligence, and emotional quotient. Our work brings valuable insights for improving and evaluating these abilities. Specifically, models trained on our dataset demonstrate a speaking and thinking style that is better aligned with humans (as shown in Table 7 and 10).

  1. Complex instruction following and reasoning

Role-playing established characters is a highly complex task that requires models to not only behave like humans (as discussed above) but also simulate specific characters, which requires in-depth understanding and adherence to massive background knowledge. This challenges a model's ability to follow complex instructions and constraints, and understand knowledge in long context. Furthermore, decision-making in role-playing represents a sophisticated reasoning task, where LLMs need to consider various constraints in personas (such as their needs, personality traits, and social relationships).

  1. Evaluation method for complex open-ended tasks

Our findings on evaluation of role-playing LLMs can generalize to evaluation methodologies for broad complex, open-ended tasks. LLM performance on mathematics and coding tasks can be effectively assessed through verifiable answers. However, nuanced evaluation for subjective, open-ended tasks remains a challenge, such as creative writing, significantly limiting the development of LLMs in these areas.

Many findings from our work have broader implications for LLM-based evaluation for complex open-ended tasks, including: (1) the importance of reference answers for LLM judges; (2) providing detailed rubrics to guide LLM judges; (3) evaluating different dimensions separately to avoid bias and (4) reasoning models' superior performance as LLM judges.

Q2: Fundamental difference between CoSER and previous role-playing works.

Our work differs from previous role-playing works primarily in dataset and evaluation.

  • For dataset, as shown in Table 1 and Figure 1, CoSER differs from previous role-playing datasets in two aspects: (1) CoSER extracts authentic, multi-turn, multi-character dialogues from books. In contrast, previous datasets are primarily synthesized by LLMs with little grounding on real plots, and typically focus on simplied scenarios between two characters or one character and one user; (2) CoSER provided comprehensive data types beyonds profiles and dialogues, including characters' actions and thoughts in messages and experiences in key events, which are ignored in previous datasets.

  • For evaluation, CoSER is different from previous evaluation methods in that: (1) CoSER thoroughly elicits LLMs role-playing performance via multi-turn, multi-character simulation, while previous evaluations are typically based on single-turn, single-character response. Hence, CoSER provides LLM Judges with comprehensive performance results for evaluation ; (2) CoSER provides groundtruth conversations as reference for LLM judges, while previous evaluations either lack reference or use GPT-synthesized dialogues as reference; (3) CoSER provides detailed rubrics written by experts to guide LLM judges, while previous evaluations simply ask LLM judges to give a score; (4) CoSER explores and mitigates bias in LLM judges, such as length bias and dimension-correlation bias.

Q3: Copyright concerns: althought the authors discussed about this in supplementary G, but I am still not sure about whether releasing processed data would by pass the copyright issue. One way of improving this might be releasing the full processing code/pipeline, and let users process on their own book data.

Thanks for your feedback and suggestion. Following your suggestion, we will seek to protect copyrights and release the full processing code to the public.

最终决定

Interesting idea and well-execution. Multiple reviewers agree that this work is well-motivated, scalable to a large corpus, and appropriate evaluation.

One common concern from many reviewers is in related work. The coverage of related works seems to be short to the currently emerging and expanding literature on this and related topics. AC highly recommends thoroughly comparing and giving take-home deep insight. A table, like in your rebuttal, would be nice to convey the core differences. AC also wants to refer TimeChara [a], which proposed role-playing LLMs, although it is an agent-user interaction focusing on point-in-time character hallucination.

AC sees the other raised weaknesses are addressed during the rebuttal period, while leaving no major concerns among reviewers.

[a] Ahn et al. 2024. TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024.