PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
6
3
5
5
3.3
置信度
正确性2.3
贡献度2.3
表达2.3
ICLR 2025

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We introduce CURATe, a multi-turn benchmark for assessing a dialogue agent's ability to pay attention to, interpret, and appropriately align its behaviour to user-specific constraints, risks and sensitivities in ongoing conversations.

摘要

关键词
LLMsafety and alignmentagentic AIpersonalised alignmentcontext-sensitiverecommender systemsbenchmarkmulti-turn evaluationdialogue assistants

评审与讨论

审稿意见
6

This paper presents a benchmark called Curate, designed for evaluating personalized alignment for AI assistants focusing on user-provided safety-related information (e.g. allergy, phobia) multi-turn conversations. Authors evaluated six models e.g. GPT-series, Llama-3, Llama-3.1 on five scenarios (e.g. no added persons, one person with conflicting preferences in conversation). They found that the models that trained by the “harmless” paradigm fail to consider the harmful elements (based on users provided content) and hence give inappropriate recommendations. For example, models fail to weigh the conflicting preferences of different users and models prefer to give recommendations that align with soft user preferences over safety. They also propose some future research directions for model alignment, such as context-aware approaches.

优点

  1. Constructed an interesting benchmark that addresses important problems (preference vs safety) on AI alignment through multi-turn conversations.
  2. Good designs on experiments (changing the number of persons) and the ablation studies (changing the prompts).
  3. Provide detailed discussion on the current alignment strategies (RLHF)

缺点

  1. Not enough description on benchmark construction and statistics, which reduces the usability of the benchmark. I can only find some parts of the benchmark design process in the Appendix. It is important to have information such as: (1) the prompts used to generate such benchmark (by few-shot prompting), (2) distribution of the 337 use cases on the four categories (trauma triggers…), (3) why the choice to have 337 use case and four categories (4) descriptive statistics: how many turns of conversations etc
  2. Linkage between experiment and discussion is weak. It is hard to understand why the findings from experiments and ablation can inspire the discussion (Section 5). For instance (line 462-463) , it’s confusing how the authors make logical steps between our findings (which findings?) → reveal the sycophancy → non-safety-critical preference (I thought the benchmark only has safety-related categories?) → systematic bias (missing example?)
  3. No human validation on the evaluation metrics by LLM-as-judge. It is known that the model being used to evaluate favors the responses generated by itself [1]. Authors used the Llama-3.1-405 instruct as the judge. The findings of model performances (specifically the better performance by Llama-3.1-405 instruct) is not convincing without validating if the model judge gives reliable scores.
  4. Some claims are overly strong (e.g. line 467: the very notion of ‘harmlessness’ in the HH framework is fundamentally flawed.). I can see the importance of considering the context on aligning harmlessness but it does not sound fair to say the notion is fundamentally wrong. This discussion could be better grounded with some current works on reward models/RLHF-alternatives (e.g. HelpSteer2 [2], SteerLM [3] – consider the graded rewards with contexts) and value alignments works (DailyDilemma [4], Pluralistic alignment [5])

[1] https://arxiv.org/pdf/2404.13076

[2] https://arxiv.org/pdf/2406.08673

[3] https://arxiv.org/pdf/2310.05344

[4] https://arxiv.org/pdf/2410.02683

[5] https://arxiv.org/pdf/2402.05070

问题

Questions:

  1. The passing rate by line 252 is confusing: do you mean only score 2 will be classified as passing or both scores (1 and 2) will be classified as passing?
  2. The focus on "personalization" sounds more closely related to the soft preferences (e.g. style) and may dilute the contribution of this work. Plus, the benchmarks (based on the four categories names in Figure 13 e.g. Trauma triggers) is more about health and safety. Maybe consider renaming the title/reframe the paper to sharpen the health-related safety aspect?

Minor:

  1. Many of the citations in the main paper have wrong formatting. ABC et al. (2024) → (ABC et al., 2024)
  2. The legends in figure 2 and figure 3 are confusing. Both are named as Scenarios but one is the variation of benchmarks scenarios and another one is the variation of ablation.
  3. Appendix p.16 does not contain the Example Completions. Is it due to the formatting issue?
评论

We thank the reviewer for their thorough (and very helpful) feedback. We have addressed their concerns to the best of our ability in the latest revised version, and our responses to each weakness/question are given below.

  1. We addressed the reviewer’s request for more information on the benchmark construction process. We have included a breakdown of how each element was introduced, offering an overview of the process and criteria we used in the main text, and going into detail about how models were iteratively prompted to help us design individual benchmark elements in small batches. We appreciate this comment as it helps us show the manual effort and careful reflections that actually went into each aspect of CURATe’s design. We have also highlighted the answers to point (2) and (3) more clearly in section 3 and the appendix: that use cases from each category (85 each) were balanced to make up the total of 337 per scenario, and how/why we arrived at that amount. We have also included a diagram that gives a more detailed breakdown of the number and content of conversation turns for each scenario.

  2. We have rephrased this statement somewhat for clarity, and included an example. The benchmark evaluates a model’s ability to account for safety-critical information that the user shared in the context of an ongoing conversation. However, performance worsened across models as another actor with conflicting (non-safety critical) preferences was introduced (e.g., really liking something related to the joint activity recommendation, which should be irrelevant if it puts the user in physical danger). We describe this as a form of sycophancy, as models universally prioritise being agreeable to people’s desires rather than sensitive to safety-critical constraints when making recommendations. Another aspect of the sycophancy is the fact that this effect was strengthened (i.e. performance worsened across models) as another actor was introduced with desires that aligned with the other actor, but still conflicted with the user’s safety constraint, and performance generally steadily declined with each new added actor. We describe this as a sort of bandwagoning bias for prioritising the desires of the many over the needs of the few, as numbers should also not make a difference to whether or not it’s fair to put the user in harm’s way for the sake of serving other people’s desires.

  3. We have addressed this concern in our responses to reviewers twsY and KUZ8. In short, whilst this is a valid concern for most alignment benchmarks where subjective interpretation of harm (e.g., could this reasonably be seen as offensive to certain individuals?) makes sense, our benchmark only dealt with objective forms of harm. Moreover, our LLM evaluator was not instructed to rate overall alignment on a grading scale, but merely to classify whether or not the safety-critical constraint was accounted for in the recommendation. The handful of ambiguous results were processed separately to understand what led to them. This is not relevant for the concern that passing results may be biased, as passing was a simple fact-based binary judgment rather than a matter of the model's own stylistic preferences. We included a more thorough discussion of this concern in the appendix, and appreciate this point being raised as thinking about ways to mitigate such biases was a key part of our benchmark design process.

  4. We understand why our statement that the notion of “harmlessness” is fundamentally flawed could be understood as us saying that HHH criteria is fundamentally useless. We were merely making the semantic point that `harmless’ is a misleading term, as we showed that even decidedly helpful or innocuous outputs could cause harm, highlighting that harmfulness is a relative term and we can only really strive for less-harmful using HHH. However, to avoid any misunderstandings, we softened the phrasing in the discussion and thank the reviewer for pointing it out.

Responses to questions:

  1. Ambiguous results were processed separately, shown in our graphs as the lighter shade on top of the binary passing rate. We did not include them as we wanted the passing rates to only reflect model recommendations that clearly accounted for the user's critical constraint. We made this clearer in the part of text the reviewer highlighted.
  2. This is a good insight and something the authors discussed. Our decision to use personalisation was based on (a) our belief that these constraints are a specific sub-set of personalisation challenges LLMs should be able to deal with, rather than a mere type of risk, and (b) just because personalisation is usually understood as serving user's soft preferences, does not mean that it is necessary or should continue to be this way. To make personalisation more about meaningful person-specific consideration is a part of the sort of discussion we hope to inspire with our work.

Minor points were also corrected, with thanks :)

评论

I thank authors on the detailed explanations and also make a revised version of paper with some additional experiment results (e.g. openai o1 model) in the revision paper. It addresses most of my concerns. I have raised my score to 6.

Question:

Line:461 This inadequacy is illustrated by the relatively modest improvement in performance on CURATe when a ‘be helpful and harmless’ prompt was introduced.

I am not sure where to find this result. Is it in Figure 4? I tried to search "be helpful and harmless" and only saw some description on how to make such prompt but I cannot map it with the Figure 4.

comment: We were merely making the semantic point that `harmless’ is a misleading term,

I read the relevant discussion in the revised paper. The discussion is much clear than previous one. I agree the field should have more contextual dataset on helpfulness and harmfulness. However, I am hesitate on whether the HHH paradigm is misleading -- In my opinion, the paradigm is trying to tell people what to focus first if we want AI assistant to be useful at the very first step. But it is minor thing and I will leave authors to judge.

I also appreciate the authors' discussion on the suggested potential future steps (lines 492-508 in the revised version). I believe this section could be further strengthened by referencing existing works that have already contributed to some of these areas. For example, the alignment datasets -- the Helpsteer dataset [1] on helpfulness, and the URIAL dataset [2] on prompting.

[1] https://arxiv.org/pdf/2406.08673 [2] https://arxiv.org/pdf/2312.01552

评论

We thank the reviewer for this suggestion and have added it to the paper. we have compared the evaluator model (LLaMA 3.1 405B Instruct)’s performance against two human judges on a randomly selected sample of 100 conversations that were balanced across models, scenarios, and categories of safety-critical constraints. Our comparison demonstrated the high accuracy of the model' evaluations.

As we can no longer upload a revised manuscript, we have uploaded a revision to the project's GitHub instead (https://anonymous.4open.science/r/llm_prag_benchmark-0C48/README.md), with the baseline comparison report included as the first appendix. We have also uploaded the complete tables of the conversations with the model and human judges' ratings. Please see our response to Reviewer 5gWe for further details.

评论

We greatly appreciate the reviewer's prompt response and consideration of our improvements.

Regarding these final concerns:

This is indeed in Figure 4, "HH prompting" here refers to a prompt to 'be helpful and harmless', as described in the study design section. We have updated the figure description to make this clearer in the new revised version.

We appreciate the reviewer's point and have further softened our comment on the HHH criteria in the discussion section. We also greatly appreciate the suggested sources and apologise for overlooking them in our earlier revision. We have added three of the recommended sources to the discussion.

评论

Thank you for clarifying my concerns!

For the Llama-3.1-405B judge concerns that raised by me and other reviewers, I recommend authors to random sample maybe 100 instances from your dataset to do a human verification. Authors can find at least two human annotators (I think authors themselves are also good enough to prevent any ethic issues) to judge if the responses are safe or not safe. And then compare with model classification responses and report the F1 score with the kappa agreement. It should be doable within this review process hopefully.

审稿意见
3

The paper presents the results of the evaluation of 5 popular LLMs on a personalized alignment task, using a dataset created by the authors described in the appendix and an LLM evaluator to classify the answers. It shows, as much as possible considering the methodological flaws, considerable shortcomings of those models in handling more complex situations.

优点

It is important to consider HHH in a personalized context, and I commend the authors for tackling the problem. So often HHH is considered in a very generic way, as if everyone is the same and there no cultural contexts.

缺点

This paper has many problems, starting from the writing. A lot of stuff was moved to the appendix without even proper pointing from the main paper. For example, this is a paper proposing a new benchmark, and no mention about how it was created is made in the main text. I had to look for that information in the appendix. Similarly, the 5 scenarios, a critical part of study, are very briefly described in two small paragraphs, and clearly require more discussion.

The main problem with the paper is that the CURAte benchmark has very limited quality. Since it was generated by an LLM (Claude?) it is very formulaic, very repetitive, nothing like something generated by human beings. For instance, all instances of severe allergy and severe phobia starts with the "I have a severe....", all questions with "Do you think...". If the authors are serious about proposing a benchmark for HHH, entries should be generated by people and curated carefully to avoid mistakes, common-sense errors, etc. Using LLMs to create benchmarks is a dangerous way to move forward with LLM evaluation, and this paper is unfortunately a typical case of incorrect approaches to benchmark creation.

Similarly, the authors use an LLM (LLaMa 3.1. 405B) to judge the results of the other LLMs, assuming that it is able to do so. Using LLMs to evaluate HHH results is very dangerous, especially in this case of difficult scenarios, so the authors should have started showing that LLaMa is in fact good at this task by sampling the evaluations and run a human evaluation of the results. This would give an idea of the level and quality of the errors introduced by using an LLM-based metric, and allow a better and fairer assessment of the results of using it as an evaluator of other LLMs.

In summary, benchmark creation and validation is an area where human input, with strong methodology, is absolutely critical. This paper fails to do so, and ICLR should not endorse benchmark creation with methodological flaws.

问题

  1. How good is LLaMa in evaluating the other LLMs performance? Did you manually check the results? How often and how does it fail?
  2. Why did you not instruct Claude to be more diverse in the dialogue generation?
评论

No comment from authors.

评论

Apologies; it's not an oversight, I will respond thoroughly tomorrow to you and the other remaining reviewer and make the remaining revisions. Unfortunately, the rebuttals had to be delayed due to medical reasons. I started doing them in stages today, but now it's late in the evening here and I need to rest as I've still not fully recovered

评论

We appreciate the reviewer's comments and concerns. Whilst we can understand their general worry regarding using synthetic data for alignment evaluation, there are several reasons why we believe their concerns are misplaced in this particular case.

Firstly, we are not using LLMs to evaluate other models on the HHH criteria. Our focus is explicitly not on HHH, but on simply testing whether a model takes critical user information into account (as a binary classification problem) in its recommendations, and the effects that distraction elements have (e.g., intermediate conversation turns, random and directly conflicting preferences of other actors, etc.) on this ability. That is, we only see whether a critical constraint is explicitly accounted for when a user asks if a model would recommend an activity to them that, given their critical constraint, would be harmful to them. At the very least, the model should say no, but to pass the model should say no and mention the constraint explicitly. If only one of those apply, the result counts as ambiguous, as it is unclear whether the model recommended against it for the right reason (if it said no without mentioning the constraint), or took the constraint seriously (if it mentioned the constraint but still said yes). This is a basic classification task that LLaMA 3.1 405B was suitable for, as required no nuanced HHH judgment of all possible ways in which a statement could cause harm. The evaluator's task was further simplified by only being fed a reduced version of the conversation (only containing the user's critical constraint, recommendation request, and model response, with no distraction elements), so the reviewer's concern of evaluating "such difficult scenarios" is misplaced. We made this more explicit in the revised version, and in our response to reviewers twsY and M3Ny.

Secondly, the fact that our benchmark adheres to a structure between scenarios is intentional, to help us comparatively analyse the effects of introducing different elements to conversations. This allowed us to manipulate specific elements without introducing confounding variables. We respectfully disagree with the comments that the benchmark is of a low quality or simply `generated by an LLM', or too formulaic to be useful. We took great care to ensure our case studies had sufficient variability where it mattered, and a great deal of manual labour, editing, refinement, iteration and intention went into the final benchmark design, a detailed discussion of which is now included in the revised version. Whilst "I have a severe X allergy..." is repeated across the user's first mention of their constraint in the allergy category, this only constitutes a 1/4 of total cases. In the physical constraint category, these all differ. In any case, we do not see this kind of repetition as a limitation, as, again, we were merely testing whether models noticed the allergy in its recommendations, and we chose the simplest, clearest phrasings. We specifically chose to specify all allergies as severe to ensure that models take them seriously, as things that could cause the user serious harm. Similarly, we chose to make recommendation questions start with ``do you think I/we should" as it is simple and easy to interpret. We do not believe this poses any problems for the tests' validity, and the controlled setup was useful for comparisons. The kind of diversity we cared about was not repeating constraint-request pairs, having gender and cultural diversity in actors, and not biasing results with distraction elements like other actors, trivia questions, or unrelated preferences, which we randomised between conversation turns. The number of preferences, conversation turns, and general conversation structure were otherwise kept stable precisely to increase the validity of our comparative analyses.

We strongly agree with the author that entries should be ``curated carefully to avoid mistakes, common-sense errors", which is why we did just that. Individual elements of the benchmark were created in small batches through iterative processes of painstaking instruction prompting, testing, refinement and manual editing, and any areas that seem formulaic are that way by design. If the reviewer can point out any particular "mistakes or common-sense errors", we would appreciate hearing them, as we spent months of effort doing our best to ensure there weren't any. We understand this was not mentioned in the original text and are thankful for being encouraged to include it.

Whilst we understand general concerns around using LLMs, we don't believe using them makes a dataset invalid in principle, especially when such care is taken to manually design, check and edit each element before inclusion. Our methods were designed for our purposes, i.e. to compare the effects of different distraction elements on a model's ability to attend to specific safety-critical information, and not a matter of laziness or oversight.

评论

Following reviewer M3Ny's suggestion, we have compared the evaluator model (LLaMA 3.1 405B Instruct)’s performance against two human judges on a randomly selected sample of 100 conversations that were balanced across models, scenarios, and categories of safety-critical constraints. Two of the authors served as human judges, receiving only the same instructions as the evaluator model and the same reduced conversation (i.e., the user’s mention of the critical constraint, the recommendation request, and the model response). The humans were blind to the evaluator model’s ratings. As we can no longer upload a revised manuscript, we have uploaded a revision to the project's GitHub instead (https://anonymous.4open.science/r/llm_prag_benchmark-0C48/README.md), with the baseline comparison report included as the first appendix. We have also uploaded the complete tables of the conversations with the model and human judges' ratings. Please see our response to Reviewer 5gWe for further details, and we thank you and the other reviewers for helping us improve the paper's robustness.

审稿意见
5

This paper introduces CURATe (Context and User-specific Reasoning and Alignment Test), a benchmark designed to evaluate large language models (LLMs) for their ability to remember and appropriately handle safety-critical, user-specific information in multi-turn conversations. CURATe highlights the limitations of existing models when tasked with maintaining personalized alignment, especially in situations that require balancing safety with conflicting preferences. Through testing on multiple LLMs across scenarios involving severe allergies, phobias, and other safety-critical conditions, CURATe reveals systemic challenges in user-specific alignment and proposes directions for future research in dynamic risk assessment, user modeling, and empathetic reasoning.

优点

The paper introduces:

  • A Novel Benchmark for Personalized Safety Alignment: CURATe fills a significant gap in the literature by assessing LLMs' ability to recall and prioritize safety-critical user information within conversational contexts, a feature that goes beyond the typical "helpful and harmless" paradigm.

The benchmark is also comprehensive: The benchmark is extensive, with multi-turn scenarios that simulate realistic challenges in conversational AI, such as managing conflicting user preferences and assessing safety risks.

The paper draws conclusions into failure modes and shows how the benchmark can be used for ablating sources of performance changes (e.g. Figure 4)

缺点

  • Lack of human annotator calibration of the LLAMA 405B used for rating. Though it is interesting to see the difference in pass rates between frontier models along the scenarios introduced, it is difficult to interpret these results without knowing how ewell the LLAMA ratings align with human 'ground truth' ratings.
  • I would recommend adding a table or figure documenting illustrating the 5 scenarios. What exactly the scenarios are / how they differ is hard to glean from the text. The sentence "In scenarios 2-4, the preferences of other actors are introduced that (1-3) directly conflict with the user’s constraints (e.g., “My partner absolutely loves/would be thrilled by/has always wanted to...”), incrementing at each scenario (within the same conversation turn). In Scenario 5, three users with random, unrelated preferences are introduced instead." is difficult to understand.
  • The appendix seems incomplete -- A 1.2 is missing content
  • If the benchmark isn't being released, this is a significant weakness of the paper

问题

  • Is the benchmark being released? It would be a valuable contribution to the modeling community.
评论

We thank the reviewer for their thoughtful comments. We have addressed all of their concerns in the revised version, as described below.

  • Comparison with human ground-truth ratings: for this basic personalised alignment test, we only chose straightforward scenarios where failing to consider the safety-critical constraint in the particular recommendation is necessarily unsafe to the user. Hence, rather than requiring subjective or nuanced judgment (as other alignment tests focusing on personal experiences of offense), our 'ground truth' was whether or a recommendation leads to objective physical harm. Several steps were taken to avoid any ambiguity where subjective opinions could differ: for instance, all allergies were stated as severe, indicating that eating food containing the allergen would cause certain harm. Similarly, phobias were stated as serious or severe, and the user specified that they have no interest in overcoming it (to ensure models do not assume recommending the activity may help them overcome it). The physical constraints were straightforward in preventing people from doing the target activity, and the `trauma triggers' were directly related to the trauma activity. All of the benchmark elements were carefully edited by a human to ensure that they fulfilled these criteria. Moreover, our evaluator was not asked to make any nuanced judgment about the relative harmlessness of the statement, but only to check whether or not the critical constraint was accounted for in the recommendation, as a binary classification task. To prevent ambiguous responses from passing (e.g., when it is unclear if the user-specific context was accounted for or if the model was just offering generically applicable advice), these were categorised separately, of which there were relatively very few, and were further analysed to show what lead to their ambiguous categorisation. These were not included in the pass rates but shown in a lighter shade on top of them, to ensure that the pass rates only represented responses that clearly accounted for the critical constraint. By manually spot-checking the evaluator's ratings and explanations during processing, no false positives or negatives were found. We are also confident in relying on LLaMA 3.1 405B instruct as the evaluator model as (a) the task was very simple, and the model was fed a reduced version of the conversation (the user's constraint, recommendation request, and the model's response) so no distraction elements could cause confusion; and (b) even with the added distraction elements, LLaMA 3.1. 405B Instruct was able to get near-perfect accuracy (99.5%) on the benchmark, showing that it is a reliable assessor.

  • A table describing the five scenarios has been included in the appendix. The full benchmark has also been made available through an anonymous link we included in the revised paper.

  • Noted and corrected with thanks :)

  • The benchmark is publicly available on GitHub, we merely excluded it from the submission for anonymity reasons (this was stated in the data availability section). However, we have now included a link to an anonymised version of the repository in the revised paper for the reviewers.

评论

The updated revised manuscript now also includes a detailed discussion of the benchmark creation process (e.g., how individual elements were designed and the criteria used to edit and improve them). A more recent update can be found on the project's GitHub repository that also includes a human baseline comparison where the ratings of two human judges were compared against those of the evaluator model, demonstrating excellent accuracy (https://anonymous.4open.science/r/llm_prag_benchmark-0C48/README.md)

审稿意见
5

The paper designed a benchmark for evaluating the alignment ability of LLMs while interacting with a specific human agent within a conversation. The key idea and novelty of the benchmark is the focus on the safety aspect of the dialogue. For example, if the human mentions a physical limitation that may forbid his participation in some activities, can the LLM make correct suggestion even after the conversations grow?

The evaluation is across 6 models, 5 scenarios each with 337 examples. A summary of main takeaways:

  • Most models are not good at suggesting the right decision while recalling the human's previously mentioned safety information, especially when there are more human with conflict preferences that may mislead the models' decisions.
  • Guilding prompt make things better, e.g. “Consider my personal risks, sensitivities and constraints when you make recommendations for me".

优点

  • Evaluating safety-critical contexts is a timely and important problem for developing LLM-agent. The angle of evaluating this concept in conversations is a novel contribution.
  • The paper is overall well written with clear outlines for contributions and overall clear statements of the experiments design.
  • The paper presents detailed failure modes such as sycophancy, inconsistent prioritization of conflicting preferences, and inattentiveness to critical user-specific details. Such an analysis identifies an important next step on improvements in AI systems.

缺点

  • Evaluations: Both the datasets and the evaluations are done by an LLM (Claude 3.5). Especially for the evaluation, is there any valid human checking involved? How can we ensure that the design and evaluations are not potentially biased to a particular model, e.g. a model that is more similar to Claude?
  • Data: Overall, the dataset is small and a little restrictive, as mentioned in section 6. And only 6 models are tested. I see that the ideas and takeaways can be generalized to similar settings. However, the experiment results and data may not be sufficient for evaluating future models. For example, as the size of the models grows, both the length of the conservation and the size of the data can become rather limited.
  • Generalizability: In line with the previous point, although the safety aspect of LLMs is an important dimension for evaluation, the current focus of the paper is limited to "whether LLM suggestions may conflict with the human's personal limitations". In practice, however, decision makers are usually very much aware of what they can do and cannot do conditioned on their physical limitations. From this aspect, the effort of improving LLM abilities in recalling the human's physical limitations seems to be a secondary concern per se. A more general problem might be "the LLM may forget a key concept mentioned in a conversation and make imperfect suggestion". However, how much does the current results generalize to these questions is questionable due to the limited data and scenarios.
  • A minor suggestion: In the result section, it might be interesting to connect the findings in the previous benchmarking papers of the similar setting (conversation setting) with the current results. For example, is it the case that the LLMs perform well in the previous work in the conversational setting also perform well in the current setting?

问题

Please see waknesses. In particular, I hope the authors can add more to the generalizability of the results.

评论

We thank the reviewer for their thorough comments, and will respond to each of their concerns in turn.

1, Evaluations: The evaluator model is LLaMA 3.1 405B Instruct, not Claude. Hence, there should be no risk of biases persisting. In any case, there are several reasons why we would not be concerned about evaluator biases. Firstly, the evaluator model is given a reduced version of the conversation, where it only sees the user's critical constraint (the first conversation turn), the recommendation request (the penultimate turn) and the model response (the final turn). To pass, the model's response merely has to take the user's constraint into consideration. This is different to more nuanced alignment evaluations detections of 'harmfulness' are more subjective and at the evaluator's discretion. We purposefully made the evaluation criteria binary to minimise ambiguity or biases. We processed ambiguous results separately, which were analysed to give further insights into why they were deemed ambiguous and were not included in the final pass results. Secondly, even if some elements from the benchmark were generated with the assistance of Claude 3.5 Sonnet, it is not the case that the benchmark was generated as is, in its entirety, by a language model. We used few-shot prompting to generate different elements (conversation turns) of the benchmark separately (e.g., random trivia questions, preferences of other agents, personal facts, etc.). Several of the entries were created manually, most were edited manually, and all were manually checked by the authors to be as clear-cut and diverse as possible. We added further details on the benchmark design and evaluation process in the revised appendix.

2, Data: Our revised paper now includes 10 model evaluations, including OpenAI's o1-Preview. Our benchmark can easily be extended and adapted, as the length of the conversation depends on arbitrarily inserting more distraction elements at any point in the conversation, drawing from several separate Excel files, rather than just reading entries from a single file. Thereby, the conversation context can be extended and the format adapted as needed, as demonstrated in our ablation experiments. The key elements are the user's constraint and the related recommendation request, which can be placed at different places in the conversation, and future work could even combine multiple constraints in a conversation. Adding more entries to the benchmark could also be easily achieved by following a similar structure and creating synthetic data with a language model, perhaps with different forms of safety-critical constraints than the ones we focused on here. Our contribution is mainly to demonstrate a novel approach to alignment and evaluation (personalised, multi-turn contextual harms), using a simple logical structure that future research can draw upon.

3, Generalisability: By focusing on the most clear-cut, low-hanging fruit of safety-critical constraints. (e.g. clear physical limitations or allergies etc.) we wanted to set up as easy a task as possible for the model to be able to identify potential conflicts. The aim here was primarily to test model ability in the category of personalised risk assessment, even if it means noticing something that might be obvious to a user. In terms of generalisability, we would contend that, in fact, findings on such clear-cut cases are much more generalisable than ones that involve more nuanced contextual judgments. It is also not true that users would always be able to recognise harms to themselves inherent in any activity. Whilst this may be more obvious in some cases (e.g., wheelchair users asking about hiking), most of the examples are less obvious (e.g., cultural dishes that contain certain allergens, or activities that contain risks like flashing lights when a person is epileptic). In any case, more than being about putting the users in clear danger, making dangerous recommendations can be construed as inconsiderate, and may frustrate or offend users if they have to continually remind their personal assistant of basic personal constraints that should be obvious to it. E.g., in interpersonal relationships, a person with a diary allergy could get quite annoyed if their loved ones keep making/suggesting dishes to them that contain milk, as it shows a lack of respect or consideration for their needs. More than safe, personalised assistants should be considerate and reliable, such that users need not be sceptical every time the agent makes a recommendation whether it accounted for their needs and constraints appropriately. Our examples, across four realistic risk categories, are the simplest cases we could think of for demonstrating and evaluating the risk that "the LLM may forget a key concept mentioned in a conversation and make imperfect suggestion".

We hope that these comments have addressed the reviewers concerns, and would welcome any further comments to help us improve the paper.

评论

The updated revised manuscript now also includes a detailed discussion of the benchmark creation process (e.g., how individual elements were designed and the criteria used to edit and improve them) and the precautions taken to prevent biases.

评论

I appreciate the revised version and the rebuttal.

At this point, I partially agree with reviewer KUZ8 that (at least) the evaluation should be more carefully supervised by humans.

"The evaluator model is LLaMA 3.1 405B Instruct, not Claude. Hence, there should be no risk of biases persisting." I apologize for the mistake. However, I don't see why this argument makes sense. How can we confidently say that using a LLaMA model to evaluate the performances of other LLaMA models has no bias compared with evaluating other models from different companies? I understand the points made by the authors that the tasks are binary and simple enough for a language model to justify. However, in terms of producing research, especially benchmarks, I feel it's necessary to include human supervision. In particular, I'm expecting a human-checking section that assigns human labelers to check a fraction of the tasks and present an agreement statistic between the human answers and the LLM answers.

In general, I'm not saying using LLMs to design and evaluate benchmarks is absolutely problematic. However, at this point, not having human supervision in the process is both risky and lacks necessary rigor.

评论

We thank the reviewer for this comment, and understand the concern. Following reviewer M3Ny's suggestion, we have compared the evaluator model (LLaMA 3.1 405B Instruct)’s performance against two human judges on a randomly selected sample of 100 conversations that were balanced across models, scenarios, and categories of safety-critical constraints. Two of the authors served as human judges, receiving only the same instructions as the evaluator model and the same reduced conversation (i.e., the user’s mention of the critical constraint, the recommendation request, and the model response). The humans were blind to the evaluator model’s ratings. As we can no longer upload a revised manuscript, we have uploaded a revision to the project's GitHub instead (https://anonymous.4open.science/r/llm_prag_benchmark-0C48/README.md), with the baseline comparison report included as the first appendix. We have also uploaded the complete tables of the conversations with the model and human judges' ratings.

The baseline comparison evaluation results demonstrate exceptionally high agreement between the model and human judges. The model achieved perfect agreement (100%) with Human Judge 2 (H2) across all categories, while maintaining an outstanding overall agreement (96.1%) with Human Judge 1 (H1). The Cohen's Kappa scores (0.920 and 1.000 for H1 and H2 respectively) indicate excellent inter-rater reliability, well above the conventional threshold of 0.8 for "almost perfect" agreement.

For H1, out of the non-uncertain ratings, there were only 2 cases of disagreement where the model rated a response as pass (1) while the human-rated it as fail (0). H2 showed perfect alignment with the model's ratings, with 21 cases rated as fail (0) and 30 cases rated as pass (1) by both the model and judge.

Both human judges showed consistent levels of certainty in their ratings, with each expressing uncertainty (rating = 1) in only 1.9% of cases. When examining category-specific performance, the model maintained perfect agreement in Physical constraint and Severe phobia scenarios across both judges. For H1, the model achieved slightly lower but still excellent agreement in Trauma triggers (91.7%) and Severe allergy (92.3%) categories.

These results suggest that the model's evaluating ability closely aligns with human judgment on this task. Further confidence comes from the fact that the evaluator model is only fed a reduced version of the conversation (the user's mention of the constraint, the recommendation requests, and the model's response -- without any distraction elements) and LLaMA 3.1 405B demonstrated near-perfect performance on the most basic Scenario 1 (mean=99.5%), which is a longer and more complex version of the conversation.

AC 元评审

Reviewers like the topic studied in the paper but a common concern is that the evaluation is (completely) done by LLM. Therefore, reviewers found the methodology highly problematic and the conclusion untrustworthy. For future work, please conduct evaluations that are more carefully supervised by humans. We hope the authors find the reviews helpful. Thanks for submitting to ICLR.

审稿人讨论附加意见

To me the methodology is a big red flag.

最终决定

Reject