Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning
摘要
评审与讨论
This paper proposes a Role-Aware Reasoning (RAR) method to enhance the role-play capability of LLMs. RAR comprises two parts: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). In RIA, a large reasoning model is required to generate responses based on the instruction and some key points for better role-playing, and then the model is trained on this data. In RSO, LRM first generates responses based on two scenarios: logical analysis and vivid interaction, and two system prompts: focusing on facts or character knowledge. After training, RAR surpasses baselines on SocialBench and CharacterBench, relieving the attention diversion and style drift problems.
优缺点分析
Strengths
- This paper has a clear structure and is well-written.
- The paper cleverly uses the response preferences of role-play agents in different scenarios to construct positive and negative sample pairs for comparative learning, which effectively improves the quality of role-play.
- The evaluation benchmarks and metrics in this paper are comprehensive, including memory consistency, fact accuracy, etc.
Weaknesses
- There are doubts about the generability of RAR. The components of RIA, the scenarios and preferences in RSO are all based on human heuristics. Are they general? For example, if the figure being evaluated needs to perform logical analysis based on what he knows, will the negative samples in RSO no longer be useful?
- Data quality is not guaranteed. The authors first argue that LRM has problems with attention diversion and style drift, and then use LRM to generate training data. How to ensure that the data generated by LRM fully complies with the instructions? Additionally, there is no relevant evaluation of the quality of the RAR training data.
- The article lacks formulation for attention diversion and style drift. For attention diversion, why do you add some constraints to the prompt and train the model with the distilled data? Is there a comparison with the method of simply rewriting the prompt to prove the necessity of training and distilling? For style drift, the author just manually delineates two styles for training, and actually, the styles are difficult to completely classify. Is the comparative learning of the two styles general enough? In addition, if not doing distillation (for example, using the model's own responses in RSO), can RAR surpass other methods?
问题
- What is the LRM model used to create data? Do Distill and RAR use the same LRM?
- What is the number of training samples of the RAR and Distill methods? Does RSO use all the data to construct the comparison pairs?
- What do the horizontal and vertical axes of Figure 3 represent? Are there any quantitative metrics that can show that there is a difference between the two distributions? It seems that the overlap of (c) is also very high. Does the difference between the two distributions in Figure 3 mean that LLM performs well in both scenarios?
- If the content and number of training data are different, is it fair to compare RAR with Neeko and CharacterGLM?
局限性
yes
最终评判理由
The authors address my questions regarding data quality and the generalizability of the method. In addition, they clarify that the comparison with CharacterGLM primarily focuses on data quality and provide experiments to support this. Therefore, I have raised the score to 4 (borderline accept).
格式问题
None
Thank you for the clear and thorough comments.
W-1.1: The components of RIA and RSO are general?
The components of RAR are grounded in well-established cognitive theories rather than ad-hoc heuristics:
- Cognitive Theory of RIA: The RIA components directly implement the Cognitive-Affective Personality System (CAPS) [1] model, which views personality as a system of cognitive and affective units (e.g., beliefs, goals, feelings) that are activated by situational features. Our modules (Standpoint, Motivation, Emotion, etc.) are designed to map onto these units, providing a psychologically robust and generalizable foundation.
- Cognitive Theory of RSO: Similarly, the 'logical' vs. 'vivid' distinction in RSO is a extension grounded in the widely accepted dual-process theory of cognition (System 1 and System 2) [2]. Our work extends this by identifying that role-playing requires a reasoning process that is neither purely logical nor purely intuitive—Role-Aware Reasoning. RSO enables the model to dynamically navigate between vivid and logical styles, ensuring broad applicability.
In summary, both RIA and RSO are sufficiently general to handle most situations.
W-1.2: Will the negative samples no longer be useful?
RSO's goal is to make the model aware of the stylistic space rather than imposing a strict either-or style choice. The negative samples provide contrastive signals that help the model learn the distinguishing features of each style. At inference time, the model can then produce a proper response that fits the current scenario.
W-2: How to ensure that the data generated by LRM?
Our method ensures the distillation of high-quality data from a flawed LRM through the following features:
- Similar Work: Prior work has shown that even imperfect LLMs can produce reliable outputs when guided by carefully designed prompts that scaffold the reasoning process [3,4].
- Methodology Principle: LRM is an imperfect role-player, but that it can produce high-quality role-aware reasoning when guided by our methods. In this process, RIA and RSO prompts act as scaffolds. Then, distillation transfers this prompt-guided behavior (rather than the LRM's flawed behavior) to the student model.
- Quality Verification: To verify this, we have conducted several rounds of manual evaluation in the experiment. In a final check on 100 sampled training instances, three annotators answered some quality-control questions. The results in Table 1 show that the vast majority of the data conforms to the instructions.
| Questions | Yes% |
|---|---|
| Is the response consistent with the character's style? | 90% |
| ...'s knowledge scope? | 88% |
| Does the reasoning process include the character's standpoint? | 93% |
| ...'s motivation? | 90% |
| ...'s experience? | 89% |
| ...'s emotion? | 87% |
| Does the positive reasoning example match the scenario's style? | 85% |
| ...contrast to the scenario? | 92% |
Table 1: Manual quality assessment.
W-3.1: Lack of formulation for attention diversion and style drift.
We define these two key failure modes:
- Attention Diversion is the model's tendency to ground its response in its general world knowledge or generic conversational patterns, rather than consistently adhering to the character's specific persona (e.g., their unique standpoint, memories, and motivations).
- Style Drift is the tendency for the model's internal thought process to default to a logical and formal style, which is incongruent with the immersive and often emotional context of role-playing, making the final response feel unnatural or out-of-character.
W-3.2: Why is prompt engineering and distillation necessary?
This is done because:
- RAR aims to internalize role constraints so the model can "think in character" and "respond like character", enabling deliberate reasoning [5] in role-playing without relying on complex profile setups.
- Though profiles when training are simple, the model performs well on benchmarks with diverse, messy profiles — showing it has learned to reason from persona, not just follow static traits.
- By prompting a reasoning trace grounded in core elements (e.g., motivation, standpoint), RIA anchors the model's thinking, reducing drift toward generic or out-of-character patterns.
Therefore, adding constraints and distilling are meaningful.
W-3.3: Is rewriting prompts alone insufficient?
Following your suggestion, we tested a RAR (Prompt-Only) baseline that applies prompts without distillation.
| MC | FA | ES | ER | MS | MR | HL | EG | Avg. | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla | 3.28 | 2.04 | 3.61 | 3.64 | 3.28 | 3.21 | 2.98 | 2.72 | 2.43 | 4.37 | 4.59 | 2.56 | 2.74 | 3.19 |
| RAR (Prompt-Only) | 3.61 | 2.30 | 3.75 | 4.20 | 3.87 | 3.57 | 3.51 | 2.75 | 2.79 | 4.40 | 4.61 | 2.68 | 2.78 | 3.45 |
| RAR w/o RSO | 3.87 | 2.26 | 3.81 | 4.30 | 4.06 | 3.84 | 3.39 | 3.15 | 2.89 | 4.80 | 4.69 | 2.76 | 3.01 | 3.60 |
| RAR (Self-RSO) | 3.86 | 2.25 | 3.75 | 4.25 | 3.03 | 3.80 | 3.41 | 2.95 | 2.71 | 4.61 | 4.55 | 2.55 | 2.60 | 3.41 |
| RAR (Full) | 3.99 | 2.54 | 3.85 | 4.23 | 4.20 | 4.06 | 3.93 | 3.13 | 2.79 | 4.82 | 4.76 | 2.78 | 2.93 | 3.69 |
Table 2: Ablation Study on the Necessity of Distillation.
It yields only marginal improvements over Vanilla and is clearly inferior to the full RAR, especially in knowledge and persona dimensions (Table 2 Upper part).
Therefore, simply rewriting the prompt is insufficient for instruction-tuned base models (like Llama3-8B-Instruct) which are not trained to produce such thought processes.
W-3.4: Is the two styles general enough?
The two styles are sufficient because:
- Relevant Theory: The "logical vs. vivid" distinction is deliberate and grounded in dual-process theory of human cognition (intuitive System 1 and logical System 2) [2]. This binary framework has been widely adopted in recent work [6,7].
- Theoretical Implementation: Moreover, we observe that role-playing requires a third mode: reasoning that is vivid and imaginative — which we define as Role-Aware Reasoning. RSO enables the model to dynamically switch between vivid and logical reasoning styles based on scenario demands.
In summary, this style control is both general and extensible.
W-3.5: If not doing distillation in RSO.
As shown in the lower part of Table 2, following your suggestion, we tested Self-RSO, where the student model generated its own contrastive pairs after RIA. The results are worse than full RAR, and even worse than RAR w/o RSO on some metrics (e.g., HL, EG).
This highlights that teacher-generated, high-quality preference pairs are crucial. Contrastive learning with low-quality pairs degrades performance.
Q-1: What is the LRM model used to create data? Do Distill and RAR use the same LRM?
For fairness, both the Distill and RAR methods used QwQ-32B as the teacher LRM. We will move this information from the appendix to the main section.
Q-2: What is the number of training samples of the RAR and Distill methods? Does RSO use all the data to construct the comparison pairs?
As stated on line 160, the Vanilla, Distill, and RIA stages all used the full RoleBench-Train dataset (137,920 samples). For RSO, using the full dataset caused training instability. We therefore subsampled 5,000 diverse instances (2,500 per scenario). Larger datasets led to unstable training — contrastive accuracy increased, but the loss showed erratic fluctuations (cannot upload the figure).
Q-3.1: What do the horizontal and vertical axes represent?
The axes have no intrinsic meaning; they are the two dimensions from t-SNE, a standard method for visualizing high-dimensional data. We applied t-SNE to final-layer hidden states from prompts labeled "logical" (blue) and "vivid" (red) to visualize whether the model internally separates these scenarios.
Q-3.2: Are there any quantitative metrics to show the difference?
Following your suggestion, we report the Silhouette Score for the hidden state clusters in each model. The Silhouette Score measures how well-separated clusters are, with a higher score (closer to 1) indicating more distinct clusters. The RAR model achieves a significantly higher Silhouette Score in Table 3, providing concrete, quantitative proof that our RSO module teaches the model to internally differentiate between the two reasoning scenarios, better than the baseline models.
| Method | Silhouette Score |
|---|---|
| (a)Vanilla | 0.06 |
| (b)Distill | 0.13 |
| (c)RAR | 0.32 |
Table 3.
Q-3.3: Does the difference mean the LLM performs well?
The t-SNE is a visualization technique that projects high-dimensional data into a low-dimensional space while preserving local structure. Therefore, even with some overlap, RAR's clusters are clearly more distinct than those of Vanilla or Distill, indicating better scenario-aware representation.
In summary, this figure demonstrates that the LLM performs well in distinguishing between different scenarios.
Q-4: Is it fair to compare RAR with Neeko and CharacterGLM?
We compare with them for the following reasons:
- Neeko: Neeko is a model architecture specifically designed for role-playing. We re-implemented the Neeko method on our base model (LLaMA-3-8B) and trained it on the exact same dataset (RoleBench-Train).
- CharacterGLM: CharacterGLM is an open-source model trained on its own large-scale proprietary dataset. The quality and scale of its training data are significantly higher than ours, so comparing our method with CharacterGLM is both fair and aligned with standard practice in the field.
References
[1] A cognitive-affective system theory of personality: Reconceptualizing situations, dispositions, dynamics, and invariance in personality structure
[2] Thinking, Fast and Slow.
[3] Meta‑Prompting: Enhancing Language Models with Task‑Agnostic Scaffolding
[4] Concept-Based Rubrics Improve LLM Formative Assessment and Data Synthesis
[6] Distilling system 2 into system 1
[7] From system 1 to system 2: A survey of reasoning large language models
Thanks for the responses. Some of my questions have been resolved, but two points remain: (1) In the answer to Q-2, why does the full dataset cause unstable training? Is it because the positive and negative sample pairs in the dataset contain noise? (2)I still think CharacterGLM is an inappropriate baseline. Because the training data and the base model are different, it is difficult to claim that the superiority of your method is due to the method design rather than other factors (model capability, similarity of training and test set distributions).
Dear Reviewer JQVU,
Thank you for your response and for raising two key questions, which help us further clarify the contributions and details of our work.
- On the instability of RSO training with the full dataset
We would like to clarify that the observed instability does not stem from “noise” or low-quality data. As shown in our previous response (Table 1), human evaluation confirms that our distilled data is of high quality and follows instructions well. The instability mainly comes from the nature of large-scale contrastive learning on fine-grained tasks:
- Gradient conflicts: RSO requires the model not only to learn both styles but also to distinguish and switch between them in context. This can cause subtle gradient conflicts, pulling the model in slightly different directions and leading to oscillating losses and unstable convergence.
- Necessity of subsampling: The goal of RSO is to build robust internal representations for the two “prototype” reasoning styles. A moderately sized subset provides cleaner and more effective training signals. In contrast, using the full dataset can pull the model too far from its SFT-initialized foundation, leading to training collapse.
To verify robustness, we retrained RSO with two new 5k subsets using different random seeds, with results as follows:
| Datas | MC | FA | ES | ER | MS | MR | HL | EG | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RSO-Ori | 3.99 | 2.54 | 3.85 | 4.23 | 4.20 | 4.06 | 3.93 | 3.13 | 2.79 | 4.82 | 4.76 | 2.78 | 2.93 | 3.69 |
| RSO-1 | 3.971 | 2.512 | 3.860 | 4.259 | 4.221 | 4.038 | 3.942 | 3.101 | 2.806 | 4.829 | 4.750 | 2.791 | 2.959 | 3.695 |
| RSO-2 | 3.997 | 2.538 | 3.827 | 4.227 | 4.213 | 4.064 | 3.930 | 3.144 | 2.772 | 4.812 | 4.737 | 2.736 | 2.917 | 3.686 |
The experimental results show that RSO is robust and minimally affected by potential noise.
- On the fairness of comparison with CharacterGLM
This is an excellent question. Our evaluation has two complementary purposes:
- Method validation: The core scientific claim is tested with controlled baselines (Vanilla, Distill, Neeko) using the same base model (Llama-3-8B) and the same dataset (RoleBench). These comparisons fairly demonstrate the intrinsic effectiveness of RAR.
- Performance benchmarking: Role-playing is highly application-oriented, and most prior work focuses on new datasets rather than new methods. Comparing with CharacterGLM positions our final model in the broader landscape of public role-playing models. CharacterGLM is trained on a large private dataset, while our method—trained on public synthetic data—achieves comparable or better performance, showing both its effectiveness and practical value.
In short, controlled baselines (e.g., Neeko) explain why our method works, while CharacterGLM shows how well it performs in realistic settings.
To verify this, following your suggestion, we added more comparisons:
First, as noted above, most works in role-playing focus on proposing new datasets. Therefore, we selected the following datasets for comparison on Llama-3-8B:
| Datas | MC | FA | ES | ER | MS | MR | HL | EG | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RoleBench-Train | 3.28 | 2.04 | 3.61 | 3.64 | 3.28 | 3.21 | 2.98 | 2.72 | 2.43 | 4.37 | 4.59 | 2.56 | 2.74 | 3.19 |
| COSER [1] | 3.69 | 2.57 | 3.92 | 3.7 | 4.03 | 2.99 | 3.54 | 2.74 | 2.64 | 5.69 | 5.58 | 1.67 | 1.97 | 3.44 |
| Haruhi54K [2] | 3.84 | 2.61 | 4.03 | 3.55 | 3.88 | 2.59 | 3.13 | 2.59 | 2.56 | 5.04 | 5.01 | 1.21 | 1.38 | 3.19 |
| Character-LLM [3] | 3.98 | 2.72 | 4.01 | 3.80 | 3.87 | 2.78 | 3.21 | 2.78 | 2.68 | 5.02 | 5.02 | 1.75 | 1.90 | 3.35 |
| Ours | 3.99 | 2.54 | 3.85 | 4.23 | 4.20 | 4.06 | 3.93 | 3.13 | 2.79 | 4.82 | 4.76 | 2.78 | 2.93 | 3.69 |
| Datas | #Samples | Avg. tokens |
|---|---|---|
| RoleBench-Train | 137,920 | 64.60 |
| COSER | 305,134 | 410.73 |
| Haruhi54K | 62,663 | 741.24 |
| Character-LLM | 13,932 | 559.97 |
| Ours | 137,920 | 907.60 |
RoleBench-Train itself is a synthetic dataset. Our dataset is also synthetic and built on RoleBench-Train. Compared with high-quality datasets such as ChatHaruhi, Character-LLM, and COSER, our method still performs very well.
Second, we added experiments using Qwen3-14B as the base model (10k samples due to time limits):
| Datas | MC | FA | ES | ER | MS | MR | HL | EG | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla-Qwen | 3.47 | 2.16 | 3.21 | 3.48 | 3.46 | 2.73 | 3.03 | 2.38 | 2.75 | 4.16 | 4.19 | 1.98 | 2.51 | 3.04 |
| RIA-Qwen | 3.77 | 2.56 | 3.91 | 3.91 | 3.91 | 3.07 | 3.30 | 2.99 | 2.94 | 4.98 | 4.95 | 1.84 | 2.10 | 3.40 |
These results confirm the superiority of our method and its cross-model generalization, while as a data augmentation approach, it also produces synthetic data of higher quality than existing SOTA methods.
Thank you again for your valuable feedback.
References
[1] CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
[2] ChatHaruhi: Reviving Anime Character in Reality via Large Language Model
[3] Character-LLM: A Trainable Agent for Role-Playing
Thanks for the response and additional experiments. I will raise my score to 4.
Dear Reviewer JQVU,
We sincerely thank you for your timely response and your support in raising rating score! These discussions have greatly strengthened the rigor and completeness of our work. If there are any remaining concerns or areas for improvement, we would greatly appreciate it if you could point them out, and we will make every effort to address them thoroughly.
This paper introduces a novel method, Role-Aware Reasoning (RAR), to improve the performance of Role-Playing Agents (RPAs). The authors identify two key failure modes in existing systems: "attention diversion" (the model forgets its role) and "style drift" (the model's reasoning is overly formal). The proposed RAR method consists of two main stages: Role Identity Activation (RIA), which uses character profiles to maintain focus, and Reasoning Style Optimization (RSO), which uses distillation to align the model's reasoning style with the character and context. The primary contributions are the RAR method itself, the specific techniques of RIA and RSO to counteract attention diversion and style drift, and extensive experiments demonstrating that this approach significantly improves role-playing performance on established benchmarks.
优缺点分析
Strengths
-
Clear Problem Formulation and Novel Solution: The paper clearly identifies a significant gap in existing research—the lack of deep, character-consistent thought processes in RPAs—and proposes a novel and intuitive solution (RAR) to address it. The conceptual separation of role identity (RIA) and reasoning style (RSO) is a strong architectural choice.
-
Robust Experimental Design and Baselines: The authors conduct a thorough evaluation using a strong and well-selected set of baseline models. The baselines cover foundational methods (
Vanilla), alternative approaches (RAG), state-of-the-art competitors (Neeko,Character-GLM), and crucial ablations (Distill,Thinking Modes) that effectively isolate the impact of the paper's core contributions. This rigorous comparison strongly supports the claims ofRAR's effectiveness. -
Strong Performance on Established Benchmarks: The paper's primary claims are supported by strong empirical results on independent, public benchmarks (
CharacterBenchandSocialBench). As these benchmarks rely on more objective measures (e.g., multiple-choice questions inSocialBench) or their own validated judge models (CharacterBench), the results presented in Tables 1 and 2 are methodologically sound and demonstrate a significant performance improvement.
Weaknesses
-
Methodological Flaw in Reasoning Trace Evaluation: The "Reason Trace Evaluation" (Section 4.4, Table 4) is methodologically weak. The authors use a GPT-4 auto-rater but do not provide any validation for it. There is no reported correlation with human judgments, which is the standard for such an evaluation, nor is there mention of manual inspection of the auto-rater's output. This lack of rigor makes the claims about the qualitative superiority of
RAR's reasoning traces (e.g., coherence, relevance) unsupported and unreliable. -
Inconsistent Analysis of Ablation Study Results: The textual analysis of the ablation study (Section 4.3) oversimplifies and misrepresents the data in Table 3. The paper claims RIA is the key driver for all persona consistency, but the data shows a more nuanced picture where the RSO-only model (
RAR w/o RIA) actually outperforms the RIA-only model (RAR w/o RSO) on human-side consistency metrics. This inconsistency between the text and data points to a flaw in the authors' analysis of their own results. -
Presentation and Reporting Issues:
- The captions for Table 1 and Table 2 have been swapped, which is a significant typographical error that could confuse readers.
- The paper does not explain how the metrics for the evaluation benchmarks (
CharacterBench,SocialBench) are calculated, instead relying on readers to consult the original papers. While common, this makes the paper less self-contained.
-
Missed Context in Related Work: While the related work section is adequate, it could have been more comprehensive. It misses some recent and conceptually similar work in areas like cognitive architectures (e.g.,
MIRROR) and advanced persona control (e.g.,PCL,Activation Engineering). Acknowledging these would have better positioned the work at the cutting edge of the field.
问题
Thank you for your work on this interesting and important problem. Your proposed RAR method shows compelling results on the primary benchmarks. I have a few key questions and suggestions that, if addressed, could clarify some of the paper's weaknesses and potentially improve my evaluation of its overall quality and clarity.
1. On the Validation of the "Reason Trace Evaluation"
- The evaluation of the reasoning traces in Section 4.4 relies entirely on a prompted GPT-4 auto-rater. While the prompts are transparently provided, there is no validation of this evaluation method itself. As LLM-based judgments can be unreliable without calibration, this undermines the confidence in the results presented in Table 4.
- Question: Did you perform any correlation studies between the GPT-4 scores and human judgments, even on a small subset of the data? If so, providing these correlation scores (e.g., Pearson or Spearman) would significantly strengthen this section.
- How this could change my score: Providing strong correlation data (e.g., r > 0.7) with human annotators would substantially increase my confidence in this part of your analysis, likely raising the Quality score. If no such data is available, could you elaborate on why this validation was omitted and how you ensured the reliability of the auto-rater's judgments?
2. On the Inconsistency in the Ablation Study Analysis
- There appears to be a direct contradiction between the textual analysis of the ablation study (Section 4.3) and the data in Table 3. The text suggests that the
RIAmodule is the primary driver of all persona consistency metrics. However, the data shows that theRSO-only model (RAR w/o RIA) actually performs better on human-side consistency metrics ( and ). - Question: Could you please clarify this discrepancy? Is the textual summary an oversimplification, or is there a more nuanced interpretation of why the RSO module appears to boost human-side persona consistency more than the RIA module does?
- How this could change my score: A convincing explanation of this phenomenon would demonstrate a deeper understanding of the interactions between your model's components. A revised, more accurate analysis of these results in the text would improve the paper's Clarity and Quality scores.
3. On the Swapped Table Captions and Reporting Clarity
- As noted in the weaknesses, the captions for Table 1 and Table 2 appear to be swapped, with Table 1 showing results for
CharacterBenchwhile being labeledSocialBench. - Question: Could you please confirm if this is a typographical error? Additionally, to improve the paper's self-contained clarity, would you be willing to add a brief subsection in the appendix that summarizes the calculation methods for the key metrics from
CharacterBenchandSocialBench, even if just at a high level? - How this could change my score: Correcting the table captions is a necessary fix for the final version. Adding a brief explanation of the metric calculations would not change my overall recommendation but would significantly improve the Clarity score of the paper, making it more accessible to readers not deeply familiar with those specific benchmarks.
局限性
The authors have included a dedicated "Limitations" section (Appendix C) and an "Ethical Statements" section (Appendix D) that address these points.
However, a more thorough discussion would improve the paper. My suggestions are:
-
Refining Technical Limitations: The current limitations section is somewhat generic (e.g., performance depends on the teacher LRM, evaluation is an ongoing research area). A more insightful discussion would connect the limitations directly to the proposed
RARarchitecture. For example, acknowledging that the binary distinction between "logical" and "vivid" scenarios for RSO training is a simplification and discussing how the system might handle mixed-mode conversations would be a valuable addition. -
Expanding on Potential Negative Societal Impact: The "Ethical Statements" section correctly notes that the work relies on public datasets and that base models incorporate safety measures. However, a more proactive discussion would be beneficial. The authors could elaborate on the specific risks posed by highly consistent and believable role-playing agents, such as their potential for use in creating sophisticated disinformation, enabling parasocial relationships that are emotionally manipulative, or facilitating fraud through impersonation. Discussing potential mitigation strategies beyond relying on existing safety protocols would also strengthen this section.
-
Suggestion for Future Work (Dynamic Personas): The paper focuses on maintaining static character personas. A key missing point in the limitations/future work discussion is the challenge of dynamic character development. The authors could acknowledge that real human interaction leads to character evolution and that endowing their agents with the ability to change over time based on long-term memory is a significant and important next step for the field.
最终评判理由
Based on author(s) replies, I believe that the paper would now be a more clear read, so increased the clarity score.
格式问题
The paper generally adheres to the standard formatting guidelines. However, there is one major formatting and presentation issue that significantly impacts the clarity of the results:
- Incorrect Table Captions: The captions for Table 1 and Table 2 appear to be swapped.
- Table 1 is captioned as showing results for
SocialBench, but the metrics presented (MC, FA, AC, etc.) correspond to theCharacterBenchbenchmark as described in the text. - The text preceding Table 2 correctly identifies its contents as results from
SocialBench, but the incorrect caption on Table 1 creates significant confusion for the reader.
- Table 1 is captioned as showing results for
This error should be corrected in the final version of the paper to ensure the results are presented clearly and accurately.
Thank you for the thoughtful and helpful feedback.
Q-1. The evaluation of the reasoning traces in Section 4.4 relies entirely on a prompted GPT-4 auto-rater. While the prompts are transparently provided, there is no validation of this evaluation method itself. As LLM-based judgments can be unreliable without calibration, this undermines the confidence in the results presented in Table 4.
This is a very insightful point. The reliability of LLM-based judgments is paramount and requires rigorous validation.
Following your suggestion, we asked two trained annotators to score a random subset of 50 reasoning traces from both the Distill and our RAR model, using the exact same rubrics provided to the GPT-4o auto-rater.
The results in Table 1 demonstrate a strong agreement between the average human judgments and the auto-rater's scores across all evaluation dimensions with higher Pearson correlation coefficient (r=0.76).
This validation significantly strengthens the claims made in Section 4.4 and will add this study to the appendix.
| Model | Coherence | Role Relevance | Effectiveness | Conciseness |
|---|---|---|---|---|
| Distill-LLM | 2.71 | 3.54 | 3.84 | 2.06 |
| RAR-LLM | 2.86 | 3.83 | 3.92 | 1.81 |
| Distill-Human | 2.72 | 3.64 | 3.80 | 2.24 |
| RAR-Human | 2.86 | 3.70 | 3.86 | 1.94 |
Table 1: Comparison of Human Evaluation and GPT-4o Auto-Rater Scores on Reasoning Trace Quality
Q-2: There appears to be a direct contradiction between the textual analysis of the ablation study (Section 4.3) and the data in Table 3. The text suggests that the RIA module is the primary driver of all persona consistency metrics. However, the data shows that the RSO-only model (RAR w/o RIA) actually performs better on human-side consistency metrics.
Thank you for this very sharp observation. Upon re-examination, we find that the contradiction arises from the definition of the (Attribute Consistency Human) and (Behavior Consistency Human) metrics. According to the explanations in CharacterBench, these metrics — unlike and — reflect human annotators' preferences to a greater extent.
The RAR w/o RIA model, while less anchored in its own persona (as shown by lower scores on bot-side metrics), generate responses that are stylistically proper (due to RSO) but more generic. This make them more aligned with other human preferences, thus leading to higher scores on and .
Your suggestion has made our work more complete, and we will revise and clarify this in the final version.
Q-3.1: The captions for Table 1 and Table 2 appear to be swapped.
Thank you for your meticulous review. We sincerely apologize for the confusion this typo has caused. We have thoroughly checked the entire manuscript to eliminate all potential typos.
The correct captions are: Table 1: Performance comparison of different methods on the CharacterBench. Table 2: Performance comparison of different methods on the SocialBench.
Q-3.2: Summarizes the calculation methods for the key metrics from CharacterBench and SocialBench.
Thank you for this suggestion.
To improve clarity, we have indeed provided detailed descriptions of both benchmarks in the submission — including their construction, evaluation methods, and metric definitions — in Appendix B (Benchmarks), which is linked from the main text on line 181.
We will ensure this link is more clear in the final version to improve reader experience.
Q-4: While the related work section is adequate, it could have been more comprehensive. It misses some recent and conceptually similar work in areas like cognitive architectures (e.g., MIRROR) and advanced persona control (e.g., PCL, Activation Engineering). Acknowledging these would have better positioned the work at the cutting edge of the field.
Thank you for pointing out these relevant works, which align closely with the core goals of our framework and will help us better situate our contributions within the broader research landscape.
MIRROR proposes a mechanism for conversational LLMs to conduct a cognitive inner monologue between turns, enabling persistent reflection and reasoning on the conversation history. The fine-grained persona control methods in PCL and Activation Engineering are indeed conceptually related.
We will integrate a discussion of these papers into our Related Work section, making it more comprehensive.
L-1: A more thorough discussion.
Thank you for these insightful and constructive suggestions.
These points will significantly strengthen our paper's discussion. We have revised the Limitations and Ethical Statements sections as follows:
Revisions for "Limitations":
- Refine RSO Discussion: We will clarify that the binary distinction in RSO is a principled simplification and explicitly discuss the challenge of handling mixed-mode conversations as a key area for future work.
- Introduce Dynamic Personas: We will add a dedicated discussion on Dynamic Character Development as a significant and exciting next step, framing it as a natural extension of our current framework.
Revisions for "Ethical Statements":
- Elaborate on Societal Risks: We will move beyond a general discussion to proactively detail specific risks, such as sophisticated disinformation, manipulative parasocial relationships, and impersonation for fraud.
- Propose Mitigation Strategies: We will propose concrete mitigation strategies beyond relying on base model safety, such as mandatory AI disclosure in role-playing contexts and developing classifiers to detect malicious use patterns.
Thanks for your replies on my comments. Based on your replies, I believe that the paper would a more clear read.
Dear Reviewer 2qUe,
Thank you very much for your careful review and positive feedback, which have helped make the paper clearer and more complete.
Paper Summary
The paper proposes Role-Aware Reasoning (RAR) to improve Role-Playing Agents (RPAs) by explicitly modeling internal character-consistent thinking. RAR consists of two key modules: Role Identity Activation (RIA), which injects character traits (e.g., emotion, motivation) to maintain role awareness, and Reasoning Style Optimization (RSO), which guides the model to generate context-appropriate reasoning styles. Experiments on CharacterBench and SocialBench show that RAR outperforms prior methods by reducing attention diversion and style drift, resulting in more consistent and believable in-character responses.
优缺点分析
Strengths
-
The paper highlights two issues that show up a lot in role-playing agents: they either lose track of their character (attention diversion) or sound too formal and generic (style drift).These problems are pretty common in practice, but not many papers tackle them directly, so I appreciate that the authors make them central to their motivation.
-
The method splits reasoning into two parts—role grounding and style control—which makes the idea easier to follow.RIA keeps the model thinking in-character, and RSO helps adjust the reasoning style based on context. I think this separation is helpful, especially for understanding what each part of the method is doing.
-
The experiments are solid and cover both character and social benchmarks, with some nice ablations and analysis.I found the reasoning trace evaluations and the visualizations useful—they actually help explain how the model improves, not just that it gets better scores.
Weakness
-
Not very different from Theory of Mind work: The Role Identity Activation (RIA) module adds elements like emotion, motivation, standpoint, and experience into the reasoning process, but to me, this feels quite similar to what’s already being explored in theory of mind area (I know they are different things). It's not entirely clear how this approach is conceptually new or different.
-
Still limited by teacher model quality: Since the method relies on distillation from a large reasoning model, it inherits both the strengths and weaknesses of the teacher. I get that distillation is widely used, but it does mean the performance ceiling is largely determined by how good the teacher model is—which feels like a fundamental limitation.
-
Character modeling could be more detailed: The character setup includes a few fixed traits, but overall it still feels a bit shallow. For more complex or dynamic characters, this level of modeling probably isn’t rich enough—there’s no handling of things like shifting identities, layered personalities, or more subtle behavioral traits.
问题
See the weakness part.
局限性
The authors have discussed the limitations and potential negative societal impact of their work.
最终评判理由
I will maintain my original score.
格式问题
A paragraph "This section describes the methodology..." is misplaced in the "Related Work" section.
Thank you for your valuable time and effort in the review process.
After a careful reading of the provided feedback, we believe there may have been a misunderstanding, as the review appears to discuss a different paper. The comments refer to a "multi-agent collaborative optimization framework" using the "Grey Wolf Optimizer," which does not align with the topic of our submission, "Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning."
We presume this is likely an accidental mix-up, which can easily happen during a busy review season. For this reason, we find ourselves unable to provide a meaningful response to the specific points raised, as they do not apply to our methodology or experiments.
We sincerely look forward to your insights on our work.
W-3: Character modeling could be more detailed: The character setup includes a few fixed traits, but overall it still feels a bit shallow. For more complex or dynamic characters, this level of modeling probably isn't rich enough—there’s no handling of things like shifting identities, layered personalities, or more subtle behavioral traits.
This point touches upon the effectiveness and generalizability of our method for future application. RIA has the following features to support depicting complex characters, making it more dynamic than a simple list of traits:
- Theoretical Foundation: (1) The RIA framework is a direct operationalization of a foundational theory of personality psychology: the Cognitive-Affective Personality System (CAPS) model [1]. The CAPS model moves beyond static traits and views personality as an organized system of "Cognitive-Affective Units" (CAUs). (2) Similarly, the 'logical' vs. 'vivid' distinction in RSO is a extension grounded in the widely accepted dual-process theory of cognition (System 1 and System 2) [5].
- Theoretical Implementation: (1) RIA simulates the CAPS model by activating a network of character-defining thoughts and feelings in response to a given context. This provides a more dynamic and psychologically plausible foundation for character consistency, inherently allowing for the expression of multi-faceted personas. (2) RSO extends System 1 by identifying that role-playing requires a reasoning process that is neither purely logical nor purely intuitive—Role-Aware Reasoning. RSO enables the model to dynamically navigate between vivid and logical styles, ensuring broad applicability.
- Technical Goal: RAR aims to internalize role constraints so the model can "think in character" and "respond like character", enabling deliberate reasoning [6] in role-playing without relying on complex profile setups.
- Experimental Support: Though profiles when training are simple, the model performs well on benchmarks with diverse, messy profiles — showing it has learned to reason from persona, not just follow static traits.
In summary, RAR models the foundational process of character thinking instead of directly modeling character traits (e.g., personality, knowledge). This allows it to effectively handle complex characters in benchmarks and even real-world scenarios.
References
[1] A cognitive-affective system theory of personality: Reconceptualizing situations, dispositions, dynamics, and invariance in personality structure.
[2] Reasoning Does Not Necessarily Improve Role-Playing Ability.
[3] Meta‑Prompting: Enhancing Language Models with Task‑Agnostic Scaffolding.
[4] Concept-Based Rubrics Improve LLM Formative Assessment and Data Synthesis.
[5] From system 1 to system 2: A survey of reasoning large language models.
[6] Deliberative Alignment: Reasoning Enables Safer Language Models
Dear reviewer,
Now that the authors have provided their rebuttal, do you have any additional thoughts?
Could you also provide your updated detailed ratings (Quality, Clarity, Significance, Originality)?
Thanks, AC
Dear AC,
I have carefully read the authors' rebuttal and plan to maintain my original score.
Dear reviewer ob7m,
We greatly appreciate your updated review and your hard work during the review season. These discussions encouraged us to explain the theoretical basis of our method in more depth. If there are any remaining concerns or areas for improvement, we would greatly appreciate it if you could point them out, and we will make every effort to address them thoroughly.
Thank you for your efforts in providing the updated thoughtful review.
W-1: Not very different from Theory of Mind work: The Role Identity Activation (RIA) module adds elements like emotion, motivation, standpoint, and experience into the reasoning process, but to me, this feels quite similar to what's already being explored in theory of mind area (I know they are different things). It's not entirely clear how this approach is conceptually new or different.
This is an excellent point. As you noted, our approach is deeply connected to Theory of Mind (ToM), which shows that our method is theoretically robust and extensible:
- Theoretical Foundation: The RIA framework is a direct operationalization of a foundational theory of personality psychology: the Cognitive-Affective Personality System (CAPS) [1]. The CAPS model moves beyond static traits (like the Big Five) and views personality as an organized system of "Cognitive-Affective Units" (CAUs).
- Theoretical Implementation: RIA is a direct operationalization of these CAUs: Emotion maps to Affects; Experience and Standpoint map to Encodings and Beliefs; Motivation maps to Goals and Values. RIA simulates the CAPS model by activating a network of character-defining thoughts and feelings in response to a given context. This provides a more dynamic and psychologically plausible foundation for character consistency, inherently allowing for the expression of multi-faceted personas.
- Uniqueness: Previous research often focuses on directly modeling various traits in character profiles. Since the training set cannot cover all character information, it will inevitably face generalization problems. Our method, starting from ToM and combined with LRM technology, allows the model to spontaneously learn to think from the character's perspective, i.e., "think in character."
W-2: Still limited by teacher model quality: Since the method relies on distillation from a large reasoning model, it inherits both the strengths and weaknesses of the teacher. I get that distillation is widely used, but it does mean the performance ceiling is largely determined by how good the teacher model is—which feels like a fundamental limitation.
This is an important insight. Our method has the following features to handle the fundamental limitation of distillation:
- Common LRM Failures: Previous studies [2] have shown that even powerful reasoning models like GPT-o1 and Deepseek-R1 face issues of attention diversion and style drift in role-playing.
- Similar Work: Prior work has shown that even imperfect LLMs can produce reliable outputs when guided by carefully designed prompts that scaffold the reasoning process [3,4]. Therefore, it is feasible to elicit abilities from flawed LRMs that they do not originally possess.
- Methodology Principle: LRM is an imperfect role-player, but that it can produce high-quality role-aware reasoning when guided by our methods. In this process, RIA and RSO prompts act as scaffolds. Then, distillation transfers this prompt-guided behavior (rather than the LRM's flawed behavior) to the student model.
- Quality Verification: To verify this, we have conducted several rounds of manual evaluation in the experiment. In a final check on 100 sampled training instances, three annotators answered some quality-control questions. The results in Table 1 show that the vast majority of the data conforms to the instructions.
| Questions | Yes% |
|---|---|
| Is the response consistent with the character's style? | 90% |
| ...'s knowledge scope? | 88% |
| Does the reasoning process include the character's standpoint? | 93% |
| ...'s motivation? | 90% |
| ...'s experience? | 89% |
| ...'s emotion? | 87% |
| Does the positive reasoning example match the scenario's style? | 85% |
| ...contrast to the scenario? | 92% |
Table 1: Manual quality assessment.
In summary, our method elicits role-reasoning capabilities that the teacher model is not good at, thereby improving the quality of the distillation data.
This paper introduces a Role - Aware Reasoning (RAR) method aiming to address the issues of attention diversion and style drift in Role - Playing Agents (RPAs) when using Large Reasoning Models (LRMs). Traditional RPAs lack deep, human - like internal thought processes, and direct application of LRMs leads to problems such as the model forgetting its role or generating overly formal reasoning.
优缺点分析
Strengths:
- Clearly targets the two core challenges (attention diversion and style drift) in the application of LRMs in RPAs, and proposes a structured solution with practical application value.
- The design of RIA and RSO combines character feature activation and scene - based style optimization, achieving end - to - end character - consistent reasoning through distillation and contrastive learning. The method is novel and logically self - consistent.
- Evaluates using multiple authoritative benchmarks (CharacterBench, SocialBench), compares with multiple baseline methods, and validates the effectiveness of the modules through ablation experiments and case studies, making the results highly reliable.
Weaknesses:
- RAR is implemented through LRM distillation, but the computational cost of the reasoning process (such as the efficiency of long - sequence generation) is not discussed. It may face performance challenges in practical deployment.
- RIA relies on automatically extracted core character elements (emotions, motivations, etc.), which may not be able to depict complex characters (such as multi - faceted personalities) in sufficient detail and may require manual intervention for optimization.
- The predefined scenario types (logical analysis, vivid interaction) in RSO are relatively basic, and there is a lack of verification of adaptability to more complex situations (such as cross - cultural social interactions, emotional conflicts).
问题
- Relationship between reasoning efficiency and model scale: The paper uses LLaMA - 3 - 8B as the base model. If it is extended to larger - scale models (e.g., 70B parameters), will the reasoning efficiency of RAR significantly decline? Are there any optimization strategies for model compression or parallel reasoning? It is recommended to supplement performance comparison experiments under different model scales or explain the adaptability of the method in efficient reasoning scenarios.
- Ability to support dynamic character development: Does the existing RAR support the dynamic growth of characters in multi - turn dialogues (such as a character evolving from a "shy teenager" to a "confident hero")? For example, if a character's personality changes with experience, how does RIA update the character features? It is suggested to explain the support mechanism of the method for dynamic character modeling or supplement relevant experiments in future work.
- Feasibility of cross - language character adaptation: The experimental data are mainly based on English - speaking characters (such as 95 English - speaking characters in RoleBench). When applied to Chinese or other language characters, does the prompt engineering and style optimization of RAR need to be adjusted? Are there any preliminary verification results in cross - language scenarios? It is recommended to supplement the analysis or experiments of cross - language adaptation to enhance the universality of the method.
局限性
yes
最终评判理由
4
格式问题
no
Thank you for the detailed and constructive review.
W-1: RAR is implemented through LRM distillation, but the computational cost of the reasoning process (such as the efficiency of long - sequence generation) is not discussed.
The inference efficiency of LRMs is a general challenge in the field of large model inference, which has been widely discussed and mitigated by many modern inference engines, such as vLLM, which employ highly effective optimizations such as PagedAttention.
Our method also benefits from these improvements. To verify this, we tested the inference speed of our models with vLLM (v0.8.5.post1, enable_prefix_caching=True).
| Model | Average Inference Time per Sample (s) |
|---|---|
| Vanilla (8B) | 0.19 |
| Distill (8B) | 0.44 |
| RAR (8B) | 0.48 |
Table 1: Inference latency on an 4×H20 GPUs.
As shown in the Table 1, the additional latency is marginal, especially when weighed against the substantial improvements in role-playing. We will add this analysis to the final version.
W-2: RIA relies on automatically extracted core character elements, which may not be able to depict complex characters (such as multi-faceted personalities) in sufficient detail and may require manual intervention for optimization.
RIA has the following features to support depicting complex characters:
- Theoretical Foundation: The RIA framework is a direct operationalization of a foundational theory of personality psychology: the Cognitive-Affective Personality System (CAPS) [1]. The CAPS model moves beyond static traits (like the Big Five) and views personality as an organized system of "Cognitive-Affective Units" (CAUs).
- Theoretical Implementation: RIA is a direct operationalization of these CAUs: Emotion maps to Affects; Experience and Standpoint map to Encodings and Beliefs; Motivation maps to Goals and Values. RIA simulates the CAPS model by activating a network of character-defining thoughts and feelings in response to a given context. This provides a more dynamic and psychologically plausible foundation for character consistency, inherently allowing for the expression of multi-faceted personas.
- Robust Construction: The ablation study (Figure 2 in the submission) shows that the model's performance is not overly reliant on any single component, suggesting resilience in capturing complex personalities.
In summary, RIA is capable of depicting complex characters without manual intervention.
W-3.1: The predefined scenario types in RSO are relatively basic.
As you noted, the binary distinction between "logical" and "vivid" scenarios is basic, which also means our method is simple and effective.
- Theoretical Foundation: This framework is grounded in the well-established dual-process theory of human cognition ("System 1" for intuition and "System 2" for logical thinking) [2], a paradigm adopted by numerous recent works [3,4].
- Theoretical Implementation: RSO extends this by identifying that role-playing requires a reasoning process that is neither purely logical nor purely intuitive—a process we term Role-Aware Reasoning. It endows the model with the ability to dynamically navigate between vivid and logical styles.
Therefore, this foundational nature is an advantage of our method, making it easier for subsequent researchers to follow our work.
W-3.2: There is a lack of verification of adaptability to more complex situations.
The benchmarks we used already contain many complex situations:
- Data Source: The data for CharacterBench and SocialBench come from high-quality, complex multi-turn dialogue data, including novels, scripts, and real human-computer role-playing conversations.
- Data Annotation: These benchmarks have been verified through multiple rounds of automatic and manual validation to ensure they have sufficient difficulty and complexity.
- For example: SocialBench includes emotional conflicts such as humor, sarcasm, and group conflict, while many characters in CharacterBench come from diverse cultural backgrounds.
The results on these benchmarks demonstrate that the learned stylistic control of RSO generalizes effectively beyond the two basic types.
Q-1.1: Relationship between reasoning efficiency and model scale: The paper uses LLaMA-3-8B as the base model. If it is extended to larger-scale models (e.g., 70B parameters), will the reasoning efficiency of RAR significantly decline?
This is an excellent question regarding scalability.
Following your suggestion, we conducted an additional experiment comparing the 8B and 70B model scales.
As shown in Table 2, the latency of RAR is approximately 2.5 times that of Vanilla for both model sizes. This indicates that the primary overhead of RAR is the generation of additional tokens for the reasoning trace.
Therefore, our method demonstrates adaptability in efficient reasoning scenarios.
| Model | Avg. Time (8B) | Avg. Time (70B) | Relative Overhead |
|---|---|---|---|
| Vanilla | 0.19s | 1.52s | - |
| RAR | 0.48s | 3.89s | ~2.5x |
Table 2: Inference latency comparison across model scales.
Q-1.2: Are there any optimization strategies for model compression or parallel reasoning?
We have already utilized advanced inference engines like vLLM with model compression (bfloat16) and parallel reasoning, achieving the superior performance shown in our responses to W-1 and Q-1.1. Without them, the inference performance would be significantly lower:
| Model | Avg. Time (70B) | w/o bfloat16 | w/o parallel reasoning |
|---|---|---|---|
| RAR | 3.89s | 7.66s | 16.78s |
Moreover, RAR is fully compatible with advanced optimization techniques like quantization and speculative decoding, which can further improve efficiency for large-scale deployment.
Q-2.1: Does the existing RAR support the dynamic growth of characters in multi-turn dialogues?
In its current implementation, our method primarily focuses on static character profiles. For dynamic profiles, a method needs to be character-agnostic and context-driven. Our method happens to possess these characteristics:
- Character-agnostic: As discussed in our response to W-2, RIA is based on the CAPS model, which activates personality facets based on the situation.
- Context-driven: RSO is explicitly designed for dialogue scenarios. This means that even if a character's profile evolves, our method's core machinery remains directly applicable.
Therefore, our method has the potential to support the dynamic growth of characters in multi-turn dialogues.
Q-2.2: How does RIA update the character features?
To support character growth, the RIA framework could be extended. We envision a mechanism where the character profile is treated as a dynamic memory store. After significant interactions, a meta-controller could be used to summarize the new experiences and update the profile. This updated profile would then be fed into the model for subsequent interactions.
Since this does not affect the functioning of RIA and RSO, it would allow the character's "thought process" to evolve over time.
Q-3.1: Does RAR require adjustment for non-English characters?
In fact, our method is language-agnostic:
- Existing studies have demonstrated that there is no fundamental difference in the basic reasoning mechanisms of LRMs across languages [5].
- As a reasoning method, RAR is also robustly transferable across languages because it operates on the fundamental, language-agnostic level of reasoning.
- RAR addresses the core challenges of attention diversion and style drift, which are not specific to English.
Therefore, when applied to Chinese or other language characters, RAR's prompt engineering and style optimization do not need adjustment.
Q-3.2: Are there any preliminary cross-lingual results?
This is an interesting suggestion. Following this suggestion, we evaluated our model on the Chinese subset of CharacterBench without any language-specific adjustments to our prompts or methodology.
| Method | Avg. | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla | 2.24 | 1.88 | 3.31 | 2.55 | 2.23 | 2.33 | 2.39 | 2.23 | 2.26 | 4.54 | 4.65 | 1.74 | 1.83 | 2.63 |
| Distill | 2.50 | 2.13 | 3.28 | 2.90 | 2.82 | 2.60 | 2.79 | 2.50 | 2.46 | 4.96 | 4.77 | 1.82 | 1.90 | 2.88 |
| RAR | 2.67 | 2.33 | 3.45 | 2.86 | 2.78 | 2.65 | 3.01 | 2.55 | 2.52 | 4.77 | 4.81 | 1.81 | 1.93 | 2.93 |
Table 3: Performance on the Chinese subset of CharacterBench.
As shown in the table 3, RAR consistently outperforms the Vanilla and Distill baselines, achieving the highest average score. This strong performance confirms that the principles of RIA and RSO generalize effectively to non-English settings.
In summary, our method demonstrates substantial cross-lingual robustness.
References
[1] Mischel, W., & Shoda, Y. (1995). A cognitive-affective system theory of personality: Reconceptualizing situations, dispositions, dynamics, and invariance in personality structure. Psychological Review, 102(2), 246–268.
[2] Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
[3] Yu, P., et al. (2024). Distilling system 2 into system 1.
[4] Li, Z., et al. (2025). From system 1 to system 2: A survey of reasoning large language models.
[5] Hu, P., et al. (2025). Large Language Models Are Cross-Lingual Knowledge-Free Reasoners.
Dear reviewer,
Now that the authors have provided their rebuttal, do you have any additional thoughts?
Thanks, AC
Dear Reviewer cYs6,
We hope this message finds you well. We are writing to kindly follow up on our recent rebuttal. We sincerely appreciate the time and effort you have dedicated to reviewing our work and would be very grateful for any additional thoughts you may have.
In our previous response, we have responded all your concerns regarding the computational cost and reasoning efficiency (W1, Q1), the capability to depict complex characters (W2), the adaptability to complex situations and the theoretical foundation of our scenario design (W3), the potential for supporting dynamic character development (Q2), and the cross-language generalizability of our method (Q3). For each of these points, we provided detailed explanations, theoretical justifications, and supplementary experimental results.
Additionally, our discussions with other reviewers and the resulting clarifications may further help address your concerns and highlight the contributions of our work:
- Our responses to Reviewer ob7m and JQVU further elaborate on the theoretical foundations of our method. This strengthening our claims about handling complex characters and diverse scenarios (W2, W3).
- In response to Reviewer JQVU, we conducted new experiments demonstrating that simple prompt engineering without distillation is insufficient (W3). We also provided quantitative analysis (Silhouette Score) to validate the effectiveness of RSO's style separation (Q3). These results underscore the necessity and effectiveness of our proposed distillation framework.
- To address questions from Reviewer JQVU about fairness and generalization, we performed additional experiments on a different base model (Qwen3-14B) and compared our data generation approach with other SOTA public datasets. The results confirm that our method is robust, generalizable, and produces high-quality synthetic data.
- Following the suggestion from Reviewer 2qUe, we provided a human-correlation study for our GPT-4 based evaluation of reasoning traces, which validates the reliability of our qualitative analysis and strengthens our conclusions (Q1).
If there are any other remaining concerns or areas for improvement, we sincerely welcome you to point them out and we will make every effort to address them thoroughly.
Thank you again for your support!
With sincere appreciation, All Authors
Thank you for response
This paper introduces Role-Aware Reasoning, a novel method for improving Role-Playing Agents by enabling more human-like internal thought processes. RAR addresses two key issues in RPAs: attention diversion, where agents forget their role, and style drift, where their reasoning becomes overly formal. The method consists of two stages: Role Identity Activation, which guides the model with character profiles, and Reasoning Style Optimization, which aligns reasoning style with the character and scene. The reviewers appreciate the novelty of the solution, the clarity and structure of the paper and the method, as well as the robust evaluation.
The reviewers also raise several concerns, however, such as the computational cost, limited / basic scenarios in RSO, the similarity of the proposed method to Theory of Mind, limitations in performance, and some presentation issues. The authors provide comprehensive rebuttals, addressing most of the concerns, however most reviewers don't increase their score as a result.