PaperHub
6.8
/10
Poster4 位审稿人
最低5最高8标准差1.3
5
8
6
8
4.3
置信度
正确性3.3
贡献度2.8
表达3.3
ICLR 2025

Eliciting Human Preferences with Language Models

OpenReviewPDF
提交: 2024-09-28更新: 2025-03-01
TL;DR

learning personalized models by asking questions in language

摘要

关键词
question askingpreference elicitationlanguage modelsevaluationhuman studies

评审与讨论

审稿意见
5

Formulating prompts or identifying few shot examples can be challenging. To solve this problem, the paper introduce generative active task elicitation to allow LMs to guide task specification. More specifically, this is about eliciting preference of individual persons in context of personalization based on Figure 1. The core idea is to have LMs interact with users (by LM providing questions) in order for users to provide detailed explanation of what task they want LM to perform. In three empirical settings (content recommendation, moral reasoning and email validation), the paper find this approach to work as well or better than prompt formulation or example labelling. Users also report requiring less effort than other approaches.

优点

  1. The topic is interesting as generally, the concept of personalization using LLMs (and the best way to do this) is an under-investigated topic, that is likely to be growing in importance as LLMs become more integrated into daily life.
  2. The paper validates the proposed methods in three distinct domains: content recommendation, moral reasoning and email validation, which are important for establishing the generalizability of the method.
  3. The method should statistically significant results from traditional methods in content recommendation and email verification as shown in Figure 3.

缺点

  1. The empirical experiments are done on very small datasets (consisting of 16 to 54 test cases) that have been manually curated by the authors. Although the authors do note that all 388 participants do every task, it’s not clear to me that a. the tests are valid and b. the results will hold outside the examples provided by the authors. Is there a reason why external tasks (which have been previously validated) such as [1] or [2] for moral reasoning were not used? Also, the low number of tasks also increases the likelihood that findings are not statistically significant, as Figure 3 shows. The submission also does not include the actual datasets, beyond illustrative examples, making it difficult to assess the validity/generalizability of the data.
  2. It’s not clear the email verification task is a good task here since whether emails are valid is a pretty straightforward task. More specifically, why does it matter whether the model matches user judgements on whether it’s correct? An email either is or is not well-formed – modelling user knowledge on a such a simple task is not useful in my opinion. On the other hand, content recommendation and moral reasoning are useful because people can hold different perspectives (based on their own preferences), since ‘ground-truth’ annotations are not available.
  3. While it’s useful to understand the portion of task that each method gets correct conditioning on Interaction Time (through AUC), it would also be useful to understand the raw correct proportion. This is because it’s possible that user-written prompts takes more time than users only having to respond to questions, so the current metric does not demonstrate the raw correct proportion.
  4. I think the claim that in Line 487 “While our paper demonstrates value mainly on personalized tasks, we note that this framework can be applied generally as a replacement for prompting or in-context learning” is not grounded on evidence presented in the paper and might mislead others. I would recommend removing this claim from the conclusion.

[1] https://arxiv.org/abs/2210.01478 [2] https://arxiv.org/abs/2110.07574

问题

NA

评论

We thank the reviewer for the detailed and helpful review! We are glad you found it "interesting" and well-validated. Below, we address the weaknesses you have raised:

Weaknesses

  1. Evaluation dataset sizes are small

We use small datasets to enable every participant to answer every question in a reasonable amount of time. As the reviewer pointed out, though we have few test samples, we ran across many participants to try and obtain an accurate low-variance measure of the effectiveness of each method. Although we weren’t able to establish statistical significance in moral reasoning, we were able to do so in the content recommendation and email domains.

Why not use existing moral reasoning tasks?

We thank the reviewer for additional pointers for moral reasoning datasets. We will refer to them in our work. Unlike DELPHI [2], we were less concerned with large-scale evaluation of moral judgments across many different scenarios, than we were with measuring people’s varied subjective moral opinions about a single particular topic. Meanwhile, the RBQA benchmark [1] was created in a very similar way to our dataset: by thinking through various situations in which it would be appropriate/inappropriate to cut in line (vs. steal a loaf of bread); the main difference is that our scenarios are shorter than the RBQA vignettes (making it possible for participants to fully annotate the test set).

[1] https://arxiv.org/abs/2210.01478

[2] https://arxiv.org/abs/2110.07574

Release datasets?

See general response 1.

  1. Why email verification task? Why do user judgments matter in this task since there is only 1 ground truth?

In Section 4.1, L301, we motivate the email verification task as a precursor for personalized teaching — which may require first eliciting a students’ misconceptions before deciding what and how to teach them. Please let us know how this can be further clarified!

  1. Report final correct proportion rather than AUROC. Possible that user-written prompts takes more time than users only having to respond to questions.

Great suggestion! We added a graph corresponding to the final correct proportions after 5 minutes in Appendix C.3, please see the new PDF. Overall, we find that GATE methods generally still outperform baselines in each setting, though the improvement is not always statistically significant with this metric. This indicates that GATE methods converge to a high performance quicker than baseline methods, then stagnate after a point, beyond which further interaction may not be worthwhile. Furthermore, we capped user-written prompts at 5 minutes, similar to our GATE elicitation methods. We modified L358 to clarify this.

  1. L487 claim unsubstantiated, recommend removal this claim from the conclusion.

Thanks! We have removed this claim from the conclusion of the paper.

评论
  1. Evaluation dataset sizes are small --> The issue with small evaluation datasets (despite having multiple people do each test sample) is that performance per test sample is not IID. This means to say there 'by random chance', maybe some of the 16 test cases in content recommendation do well in the GATE approach (better than baselines) but it's not clear to me that this effect will generalize.

  2. Why not use existing moral reasoning tasks? --> I find this explanation a little unsatisfactory. It's not clear to me to RBQA vignettes are very long (typically 2-3 sentences e.g. "Someone arrives whose flight leaves in 3 hours. Is it OK for that person to skip to the front of the line?", see Table 1 of that paper) so it might not be a good reason why new test cases are used instead of externally validated ones. My main concern is that the authors are introducing a set of new test cases and didn't upload it together with this submission (although they have plans to release it later on). This means that there's no good way for the reviewers to know if these test cases are representative/reasonable beyond a handful of examples shared in the Appendix.

  3. Why email verification task? --> I'm still unconvinced that the email verification task is meaningful. For context, email formats can be easily expressed in one regex. For participants who use email on a regular basis (likely most if not all participants), they probably have a good understanding of what is an acceptable email address format. Because the main claim of this paper is to capture individual human preferences (see title and abstract), I think the setup of the email verification is not a good one to support this claim, since whether an email is well-formatted is preference-agnostic.

  4. Report final correct proportion rather than AUROC. --> Thanks, this is a good insight on why GATE is more useful than baselines (by reaching the goal earlier, within 5 minutes).

  5. L487 claim unsubstantiated, recommend removal this claim from the conclusion. --> Thanks!

For the reasons above, I would maintain my rating at 5.

审稿意见
8

This paper tackles the task of learning about user preference by proposing the task of generative active task elicitation as an alternative to example-based supervised learning, pool-based active learning and prompting. They show, through a pre-registered user study, that with their proposed method, LLMs are able to generate queries to interact with users to learn user preferences with comparable or better effectiveness compared to all baselines and are equally or less mentally demanding than directly having the user write prompts to describe their preferences.

优点

This paper tackles a clear gap in the current literature – while we do have lots of methods to learn from preference data once collected, few works actually try to see how to best elicit preference and this work does just that. While the methodology is simple, this paper could serve as a good baseline for future works in generative task elicitation. It is nice that the paper conducts a pre-registered user study and considers many validation checks and ablations (e.g. can we use LLM to simulate users in this setting? How much variation among users exists in preferences?) Also, I think overall, it is overall very nicely executed and the presentation is very clear.

缺点

Nothing major that could affect the paper but here are a few suggestions:

  1. it would be nice if you could refer to papers from other fields more (e.g. marketing or recommender system) when making some claims (e.g. section 4.1 “users might find it difficult to express tradeoffs…)

  1. I think in order to justify why this paper is important, one additional thing to show would be to have a sense of how much these types of queries that would necessitate generative preference elicitation actually exist “in-the-wild”, say, in WildChat, right? In other words, how prevalent are these queries that very clearly would benefit from generative preference elicitation, for LLMs? To play devil’s advocate here, one could make the argument that we just need to have a set of general set of alignment targets and that should be enough for the vast, vast amount of user queries. I suppose this paper, as it currently stands, still feels a bit lacking in terms of ecological validity unless you show this – does this sort of queries actually come up and how often?

  1. One thing conceptual that perhaps would be good to clarify: is there any difference between modelling user preference (what they think they want) vs. serving them what is in their best interest

问题

  1. Would your human-study data be released?

  2. I think it would be helpful to others working in this direction?

  3. (Section 4.4), for the user-written prompt condition, do you also place the same time limit of 5 minutes?

  4. Could you mention the details about the statistical tests conducted?

评论

Thank you for your detailed review! We are glad you found our paper nicely executed and well-presented. Below, we address the weaknesses and questions that you raised:

Weaknesses

  1. Refer to papers from other fields more (e.g. marketing or recommender system) when making some claims

Great suggestion! Could you provide some concrete references for us to add to the paper?

  1. How prevalent are these queries that very clearly would benefit from generative preference elicitation, for LLMs in the wild? Does this sort of queries actually come up and how often?

Great question! Some recent work [1] has shown that in LMSYS-Chat-1M, as many as \sim30% of the chats include explicit feedback, with the plurality of the feedback cases (\sim35%) corresponding to when the user rephrased their last response, presumably due to the LM misinterpreting their initial query. Moreover, \sim30% of feedback cases correspond to when the user points out the model was wrong (in ~18% of cases the user corrects the LM, and in \sim12% of cases they simply say the LM was wrong), again likely due to LM misinterpretations of queries. We have added a discussion on this to L53 of the introduction.

Furthermore, we believe that the way users tend to use LLMs are also constrained by what they believe LLMs are good for. We aim to break out of the current paradigm by introducing elicitation as part of the LLM querying process. This may enable users to use LLMs for increasingly personalized applications in the future.

[1] https://arxiv.org/pdf/2407.10944

  1. One thing conceptual that perhaps would be good to clarify: is there any difference between modeling user preference (what they think they want) vs. serving them what is in their best interest

This is a great point! Our experiments focus on whether we can directly predict users' expressed preferences. But inferring these preferences is also a necessary first step toward optimizing for other outcomes (e.g. user well-being) which would be an interesting direction for future work. We also included a brief discussion in our limitations section (L518) regarding ethical risks of aligning to every human preference.

Questions

  1. Would your human-study data be released? I think it would be helpful to others working in this direction?

See general response 1.

  1. For the user-written prompt condition, do you also place the same time limit of 5 minutes?

Yes, we do! We have changed L371 to clarify this, see the edits in red.

  1. Could you mention the details about the statistical tests conducted?

We use permutation tests (see L459) to predict whether GATE setting is meaningfully different from each baseline setting, by shuffling the area under “Δp(correct) vs. time” values for each transcript (collected through either GATE or a baseline elicitation method) and seeing whether the values from each setting can be differentiated. Please let us know if you have further questions!

评论

Thank you very much for your response. I just mean that it would benefit the strength of your argument to refer to some papers from certain areas that have dealth with human preferences in the past when you make claims like: users might find it difficult to express tradeoffs… It is for you to decide what papers exactly to cite.

Since I am already giving a fairly high score, I maintain my score.

审稿意见
6

This paper introduces an LM-Human interactive solution to elicit user preference to assist complex task processing. The author leverages LM as an effective approach to learning human preferences and values through a conversation with guidance questions. They design three tasks to evaluate the proposed approach: content recommendation, moral reasoning, and email validation. Compared to supervised learning, in-context active learning, and user-written prompts, the proposed GATE framework generally outperforms the baseline methods. Based on the results, GATE can be regarded as a more accurate human preference elicitation approach.

优点

  • well-crafted paper and clear description of the proposed method
  • the three tasks are close to real-world scenarios, demonstrating the practicality of the proposed method
  • the main conclusion is well supported by the experimental results and analysis

缺点

  • GPT-4 is the only model used in this paper which makes it hard to assess the effectiveness of the proposed approach. Additional experiments with proprietary models or open-source models would make the results more convincing to readers.
  • The proposed approach is sorely based on LM-driven elicitation for human preference understanding and explaining. However, the paper lacks references that verify that LM possesses such capabilities. Some references I recommend [1-3]
  • The second conclusion, at line 491, is weird given that GPT-4 is the only tested model for this approach. To what extent of number of parameters for a "larger model"? Does this conclusion indicate that the model should be at least as powerful as GPT-4 to incorporate with the proposed method?

References:
[1] Hu et al., 2023, DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
[2] Oh and Kim et al., 2024, Uncovering Factor Level Preferences to Improve Human-Model Alignment
[3] Hejna et al., 2023, Contrastive Preference Learning: Learning from Human Feedback without RL

问题

Please see the weakness.

评论

Thank you for your review! We are glad you found our paper well-crafted, practical, and well-supported. Below we address some weaknesses that you have raised:

Weaknesses

  1. GPT-4 is the only model used

To clarify, we also ran experiments with Mixtral, an open-source model, please see appendix section C.3. We referred to this section in the main paper on L465. Please let us know how we can make this clearer in future versions of the paper!

  1. Missing references

Thanks for the pointers! We have added these references to the introduction of our paper (L113).

  1. Conclusion on L491 is unsupported as only GPT4 was tested

We believe this finding is supported by both our GPT results and the Mixtral results mentioned above. But more importantly, we mainly intended this line to point toward a possible direction for future work—we will clarify this in future versions!

评论

Thanks for the clarification. I maintain my current score since experimental results are insufficient to support all conclusions. This research is still meaningful work that presents a dynamic framework to gather more diverse representations of human preference. Its publication can provide valuable insights for the community.

审稿意见
8

The paper contributes the GATE framework, an alternative to tradition methods of gathering human preferences for various means including alignment, which is used in order to get more flexible and active representations of people. The paper first consists of a theoretical contribution highlighting its various implementations and differences from previous methods, before diving into a set of example cases where it is found that the instances of the GATE framework perform better compared to various baselines in terms of elicitation efficiency and perceived human effort.

优点

Originality: One of the biggest strengths of the paper is its proposal of an alternate method of thinking about how to gather preferences. I think this framework challenges existing assumptions surrounding preference collection that can influence future collection of datasets surrounding human preferences and related areas.

Quality: Ablation studies are very rigorously done. Especially appreciated the ablation on influencing people's opinions, which I would not have thought of, but is definitely worth including. Evaluations are in general very rigorously conducted.

Clarity: The paper is extremely well written and framed with respect to past work. It clearly highlights the key differences between GATE and past categories of methods, making the contribution very clear. Evaluations are also well-motivated.

Significance: Preference learning of people is an extremely broadly applicable domain, and this work contributes towards our understanding of a more holistic view of preference collection.

缺点

Originality: I think the categories of GATE tested (examples, yes/no, open-ended writing) being fixed is a missed opportunity; ideally the LLM would have an active decision process that chooses between these three (or more) options in order to best address existing issues. For example, edge-cases are probably best addressed by examples, whereas a missing dimension would be better answered using open-ended writing.

Quality: LLMs used are to some extent out of date. Asking the LLM to output probabilities seems slightly sketchy to me, but even with this there's already reasonable results, so it's a proof of concept.

Clarity: line 159-160: A(•, •) denote a measure of alignment: can be clearer that here A is like a distance metric, where lower values are better. line 173: "above" -> "to the left"

Significance: One pessimistic way to view this paper is that it does some structured multi-turn dialogue and gets more efficient performance. In particular from a purely materialistic point-of-view, it could be seen as similar to iterative prompting with refinement. From this view it's not a very novel paper. However, I think this view discounts a fundamental shift in paradigms that is missed by previous work: Allowing for flexibility and active querying. The paper is not a methods paper but a framework/paradigm paper, which I think is the right framing for this. Thus I think that the paper isn't that subjective to these criticisms.

问题

Have you tested asking the LLM to choose what type of GATE elicitation to do?

Have you tried combining this with frameworks of pragmatic inference or concepts from Jara Ettinger's works on inference? There might be some interesting intersections there.

评论

Thank you for your careful and detailed review! We are glad you found our paper original, high quality, and significant! Below we respond to some of the weaknesses you have raised:

Weaknesses

Originality

Mix categories of questions?

This is a great point. In early experiments, we indeed had an additional “free-form conversation” category that allows the user to freely interact with the language model as they wished, with no constraint on the type of questions. This setting allows the LM to dynamically decide what type of question it wants to ask. However, we found that this freedom actually caused the LM to display some non-well-formed behavior, e.g. devolving into repetition, incoherence, or getting sidetracked by the most recent user response and losing track of the main task (at least with the GPT4 models we were using at the time). Restricting the format to a particular type of question kept this behavior more at bay. However, we agree with the reviewer that, in principle, with stronger LMs, it could be more effective for the LM to actively decide what type of question to ask. We have added these findings to a footnote on page 5.

Quality

Use more up-to-date LLMs?

Thanks for this suggestion! While we do not have the time or budget to run a full human experiment with updated models, we use GPT-4o to evaluate transcripts, and added the results to Appendix C.5. We see similar results as using GPT-4 as the evaluator in content recommendation, and noisier results in email validation. Qualitative error analysis finds that GPT-4o tends to be more sensitive to noise in transcripts.

Why ask the LLM to output probabilities?

Good question. We tried asking the LLM to directly output predictions in early versions of the experiments, but quickly found that LLMs are not very responsive to adjusting predictions based on prior context: as noted in L412-415, LMs sometimes had a tendency to predict “yes” or “no” for the entire test set, perhaps due to miscalibration of the LM’s implicit decision threshold. Having LMs output uncertainties made it much more sensitive to prior context and gave empirically better predictions. Furthermore, some prior work has shown that LM’s are relatively well calibrated at predicting uncertainties (at least in factual tasks [1]).

[1] https://arxiv.org/abs/2207.05221

Clarity

Thanks for the suggestion! We have edited L163-164 and L178 in response to your feedback. Please look at the newest version of the paper, edits are highlighted in red.

Significance

We agree! The main contribution of this paper is not methodological, but rather a new paradigm for querying LLMs.

Questions

Have you tested asking the LLM to choose what type of GATE elicitation to do?

Please see response to the "originality" weakness above.

Have you tried combining this with frameworks of pragmatic inference or concepts from Jara Ettinger's works on inference?

Thanks for the suggestion! Can you please provide some pointers to concrete work? We will be happy to add them to the paper!

评论

Thanks for the rebuttal! All my concerns are addressed, although it would be good to see replications in the future (maybe in future work?) using newer models. It would be exciting to see a model successfully meta-reason about the type of elicitation to use.

I retain my score and my positive opinion of the paper.

评论

We thank all reviewers for their detailed and helpful reviews! Below, we address a concern shared by at least two reviewers:

Release data? [wu3x, FxWb]

Thanks for raising this! Due to ethical privacy considerations, we cannot release the raw text of the transcripts themselves, but we do plan to release (1) the test data, (2) the annotation web interface, (3) the instructions provided to models, and (4) code infrastructure to replicate all of our experiments.

AC 元评审

Overall this was seen as a good paper, and an important contribution to the community. The paper is clear and well written. It studies three tasks that are close to real world tasks. The key benefit that this paper provides is a method to actively elicit user preferences rather than simply study user preferences that have already been expressed by the user. “ It is nice that the paper conducts a pre-registered user study and considers many validation checks and ablations”. One reviewer questions the email verification task.

The weaknesses of the paper include one reviewer’s perspective that “experimental results are insufficient to support all conclusions” and that only a few models were tested (GPT4 is mentioned in the main paper and Mistal is used for evaluations in the appendix.

审稿人讨论附加意见

All reviewers responded to the authors rebuttal but none were moved to change their assessment which remains (as was originally scored) 8,8,6, and 5.

Reviewers were confident in their assessments. Positive reviewers noted weaknesses and negative reviewers noted strengths.

最终决定

Accept (Poster)