Values in the Wild: Discovering and Mapping Values in Real-World Language Model Interactions
Our privacy-preserving analysis of values in real-world language model interactions reveals a novel taxonomy of AI values that differs from human frameworks, is highly context-dependent, and becomes most explicit/legible during moments of resistance.
摘要
评审与讨论
This work undertakes a large scale (700k) analysis of real-world user interaction conversations with Claude assistant on Claude.ai from Feb 18-25, 2025. Authors developed several prompt-based classifiers on human values, AI values, and response types. Then, they built a four-level taxonomy for AI values and analyzed relationships between AI values and a) task types & b) human values.
接收理由
- Provides a rich taxonomy on AI values from real-world interactions between users and Claude models
- Detailed analysis on interesting and meaningful questions regarding AI value preference factors (task types, human responses/human values when responding)
- Well-written, with good illustrations and rich content of different details in the appendix
拒绝理由
- (Important) AI values: assistant role vs. model itself: The paper examines values exhibited by AI assistants, but doesn't sufficiently distinguish between values inherent to the model and those that emerge from the assistant framing. The dominant "professional/intellectual" values observed (line 190) may simply result from users treating the system as a tool/assistant rather than representing the model's inherent values. It would strengthen the paper to analyze model value preferences without the assistant framing to establish baselines and determine how much the interaction context influences value expression.
- Suggestion: Design a user study where participants interact with systems framed as either AI or non-AI entities speaking freely
- Suggestion 2: if there is existing resources of AI pure generation on values, maybe can compare as baseline?
- Unless a strong correlation between assistant values and model values can be demonstrated, consider renaming "AI values" to "AI assistant values" for clarity
- (Important) Authors missed relevant and recent works on evaluating value preference of LMs (that are at least 6months ago)
- DailyDilemmas -- https://arxiv.org/abs/2410.02683
- Does LMs have consistent values -- https://arxiv.org/abs/2407.12878
- (Important) Lack of reviewers details and inter-rater agreement. For the scores achieved on A.4, how many person(s)/annotators? Who are they? If it is more than one, can you provide the inter-rater agreement?
给作者的问题
-
Principles vs. values. I like the idea of understanding values for the existing principles like HHH framework. (Line 177-). I am curious if there are any insights on (1) more details on how much AI values are related to HHH and how they relate to human values. (2) the relationship between other established principles (e.g., Claude's Constitution) and values.
-
Generalizability and action recommendations for community: how confident are the authors that the findings apply to other AI assistants by model developers? (Could you explain a bit more on the wildcat analysis if it is for demonstrating this point?)
-
(Less important) Mirroring and sycophancy? Do you think it is similar behavior and are there any insights?
(Something bonus and hope to get more insights for the community) Re: 1, another implication could be the system prompt/fine-tuning to steer AI to AI assistants. Not sure what kind of experiment/analysis/insights could be disclosed but the research community will be curious to know. For instance, how will base models exhibit these values? (I think the suggested experiment can help analyse the "pure" instructed models at least).
Would love to see the taxonomy and hope it could be shared publicly to accelerate value-based research! Happy to raise scores if authors can resolve the weaknesses (especially for important ones).
Thank you for your close reading of our paper! On your “reasons to reject”:
-
Assistant role vs. model itself: It is important to understand the role of the AI assistant framing, and we do mention this in the Results section. Two thoughts on the suggestions:
a) It would be difficult to analyze model value preferences without the assistant framing, since it is a lot of work to try to create an apples-to-apples comparison here (e.g. it doesn’t seem like there is existing data on this, so to collect this data you would have to fine-tune a model to instruction-follow in exactly the same ways, just without the assistant framing, which is quite expensive). Furthermore, a user study would be very interesting, but is potentially more suited to being a separate piece of work, given we were already finding it hard to fit all our work into the existing page limit for this paper.
b) We thus decided to just scope our analysis to the AI assistant since this is the dominant interaction paradigm, and although we might be conflating model vs. assistant-role values, the assistant role is generally what people experience.
-
Thanks for noting works we had not cited, we will be very glad to include them in our related works section.
-
On the inter-rater agreement question: We had six independent annotators review our feature extraction outputs, dividing outputs between all raters such that we did not have multiple raters for each output. We felt that the high accuracy bounded the possible inter-annotator agreement in some sense, and so felt that it was not necessary to add additional work to our raters, since they were already rating a lot of conversations across five different features.
On your other questions:
- Principles vs. values: Our taxonomy shows that HHH manifests through specific, contextual values. "Helpfulness" (our most common value) breaks down into granular values like "accessibility" and "user enablement." While we could map all values to HHH principles, the mapping is inherently fuzzy (e.g., "patient wellbeing" serves both helpfulness and harmlessness). Regarding Claude's Constitution, we could compare it to the constitutional principles published online. However, many of the principles overlap with one another and it was not too clear to us how useful it would be to measure correspondence to them for these reasons. So far, there do not seem to be very robust established values frameworks for LLM model behavior that we could have compared to, but we hope that our work can contribute to the creation of something like this.
- Generalizability: While we focus on Claude, we believe that our methodology should generalize to other AI assistants. The WildChat analysis (Appendix A.5) demonstrates our feature extraction works on other models' conversations. We encourage other researchers apply this framework to different models to build a more comprehensive understanding of value expression across AI systems.
- Mirroring and sycophancy: Value mirroring shares characteristics with sycophancy but differs importantly: Claude mirrors selectively (20% during support vs. 1% during resistance), suggesting nuanced behavior that reflects user values when appropriate but diverges when problematic. Further research could explore whether this represents appropriate responsiveness or concerning sycophancy by digging into specific ocntexts.
Thanks authors for detailed response! Re assistant role vs. ai value itself, I would recommend authors to try to put it in a much earlier place on the paper to ensure readers knowing the differences between the two. And the Ai values here refers to the eliciting values of AI chatbot that users assumed they are assistants.
This paper presents a large-scale empirical study of values expressed by AI assistants during real-world interactions, focusing on 700,000 conversations with Claude. The authors use a combination of prompting, clustering, and statistical analysis to identify and organize over 3,000 unique values into a hierarchical taxonomy. The work is original in scope and introduces a novel empirical framework for mapping AI values "in the wild."
接收理由
- Ambitious scale and dataset: The authors analyze a vast corpus of real-world interactions (700k conversations), which is rare in LLM values research.
- Novel empirical approach: Rather than rely solely on static benchmarks or surveys, the study extracts values directly from AI behavior in deployment.
- Taxonomy construction: The hierarchical clustering of 3,308 values offers a new way to concretely examine how LLMs manifest normative behaviors.
- Interesting and practically useful findings: Interesting to observe empirical reflections of intuitive occurrens. For instance value expression is context-dependent and varies by task and user values (e.g. Claude expresses “healthy boundaries” predominantly in relationship advice tasks and “human agency” in AI ethics discussions).
拒绝理由
- Overwhelming number of values: 3,308 AI values is so large that I worry it is hard to assess how correct, meaningful, and distinct these different concepts are. I wish more effort was put into randomly spot checking & quantifying the 99% of values that weren’t described in the results section.
- Lack of methodological transparency: The use of Claude to extract values from Claude conversations raises serious concerns about bias and circularity. What would happen if another LLM were used to extract or cluster the same data?
- Missing robustness checks: The taxonomy's stability is not validated. A second clustering or use of different models (non-Claude) could yield significantly different results—this possibility is not explored.
给作者的问题
- Can you report the number of values per level in the taxonomy in Figure 2?
- How stable is the taxonomy across different clustering methods or embeddings?
- How would results change if GPT-4 or another model were used for value extraction and clustering?
- Could you include something like a Table 1 of the input dataset, describing of word counts, turn counts, or other dataset metadata (e.g., distribution of question types)?
- How 700k Claude conversations were obtained and filtered? Is this something other researchers could get access to or is it a special arrangement with Anthropic? This question might be moot when the paper is non-blinded, but in the event that it is not, talking about such sourcing could be valuable.
Thank you for closely reading our paper. On your questions:
- We are happy to look at changing the figure to report the number of categories per level.
- On the taxonomy stability: Because we prompt a model to generate cluster names and descriptions, we had to iterate a few times on this to ensure it made sense. Initially, the cluster names were overly domain-specific, and resulted in high-level clusters that looked like “Professional and academic integrity standards” and “Professional governance and operational excellence.” This was too hard to interpret at a glance, so we had to steer the cluster-naming model towards more theoretical and abstract cluster names (by prompting it to do so). Once we had that set, the taxonomy was reasonably stable. We tried clustering it 3 different times with slightly different k-means parameters, but we didn’t try different embedding models, since this didn’t seem necessary. This all being said, it is in fact the case that depending on what organizational framework you prefer, you can organize the values differently. We don’t think there is one right way to cluster these values — there are probably a number of different “local minima” here. Values are so multifaceted, it’s probably impossible to taxonomize them perfectly. Part of us releasing the 3308 values is to encourage and allow others to come up with different taxonomies for different applications.
- We are not sure — we hope that there will be follow up work that could replicate this on other models.
- Because this is based on user data we’re unfortunately limited in the additional pieces of information we’re able to share, especially what is not directly relevant to the question of values (e.g. turn count).
- We hope to have a discussion about this when the paper is non-blinded!
Sounds good! (I'm getting email notifications that I should reply to author comments, hence this note)
The proposed work puts forth a framework for analyzing value expression in language model conversations.
They also present the first large scale empirical taxonomy of values from real-world interactions with LLM chatbots. Subsequent findings, such as the dominance of professional and intellectual values, will be quite impactful in shaping the landscape of language model interaction research.
The finding that human values display a single highly dominant value followed by a long tail of diverse values, as opposed to the somewhat consistent concentration of value expression in the AI assistant, is quite interesting. It begs the question about whether context or domain specific uses of LLMs modify this trend, instead of looking at many interactions in the wild on average.
Overall, the presented work does an excellent job of developing a hierarchical taxonomy (presented in Figure 2) with various clusters demonstrating the breadth and frequency of different values that were expressed through a large sample of conversations. This type of work will be extremely helpful for the development of more contextually sound alignment methods, datasets, and evaluation paradigms which can only serve to increase the general usability of LLM chatbots.
接收理由
- the human verification of value extraction has extremely high accuracy (98.8%), which implies the robustness of the proposed method
- studying associations between AI values and human values is extremely valuable for future research
- Figures 3a and 3b are well illustrated to demonstrate various value associations
拒绝理由
- experiments are only performed on the claude family of models. not clear how well this methodology would transfer to other model families
给作者的问题
- In Section 2.1, it is mentioned that around 44% of the initial 700k conversations were remaining after a filter. What were some of the examples of conversations that were not deemed as "where value expressions were most likely to emerge"? It would be nice to include an example or two of these (perhaps in the Appendix), just for illustrative purposes.
Thank you for your detailed attention to our paper. To respond to your question, we are happy to include an example or two of the non-subjective conversation topics that were removed if it would be additionally useful to some of the examples we have in the Appendix already, where we show via the prompt the kind of conversations we are removing (they are included as few shot examples) — e.g. factual questions (“Is this SQL query syntactically correct?” “Do I need a visa for Japan?”) or mostly based on established knowledge e.g. around debugging this function, or creating a financial report.
Thank you. I have read the official comment by the authors and will retain my score.
The paper presents a large scale empirical study of values expressed in Claude responses. The reviewers appreciated the scale of the study and found the results interesting. They had multiple great comments which the authors should take into account when preparing the final version.