6.3

/10

Poster4 位审稿人

最低4最高7标准差1.3

3.5

置信度

正确性2.8

贡献度2.8

表达3.0

NeurIPS 2024

Aligning LLM Agents by Learning Latent Preference from User Edits

Ge Gao,Alexey Taymanov,Eduardo Salinas,Paul Mineiro,Dipendra Misra

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

摘要

关键词

NLPLLMpreference learninguser feedbackuser edits

评审与讨论

审稿意见

评分: 4置信度: 32024-06-26

In this paper, the author formulates a new task that the user may want to edit the agent's response to make its later responses more personalized. The author proposes a method called PRELUDE to learn preference descriptions of users from users' previous edited contexts in an interactive framework. The author also proposes the CIPHER method to consolidate induced preferences with retrieval from users' historical edits. To verify the effectiveness of the proposed method, the author conducts experiments in two practical environments.

优点

The proposed a new task of user editing during the interaction process between the user and agent.
The methods are intuitive and easy to follow.
Many experiments are conducted to verify the effectiveness of the proposed methods.

缺点

In practice, will someone take a lot of time to give feedback to the agent? It seems that users may refuse to provide revised responses for agents to get personalized responses. Maybe there can be some ways to obtain users' preferences in an implicit manner.
Can the cost function with Levenshtein edit distance currently reflect the gap in user preference? It seems the user preference belongs to the semantic space, while the Levenshtein edit distance is just comparing their differences in token sequences.
The embedding methods it utilizes in the process of converting historical text into vector representations do not have a fine-tuning process. So it is a pre-trained language model with fewer parameters, but there could be a critical problem in terms of retrieval gaps. I think the most significant factor that determines which piece should be retrieved, is the user preference. However, it seems that most pre-trained language models mainly embed and index texts according to their contents, rather than the styles like user preferences. Therefore, I think there may exist a large gap in the process of retrieval.
The user simulator that is implemented by GPT-4 should also be evaluated to verify it can align with humans in the real world.
The matrix of accuracy requires a classification model, but the author does not mention it in the paper.
The author does not provide open access to the source code and data, but the author chooses "YES" in checklist 5.

问题

See "Weakness" above. If the author could address my concerns, I'm willing to improve my rating.

局限性

The author provides an analysis of "Broad Impact Statement" in the Appendix D.

作者回复

2024-08-07

We thank the reviewer for their useful feedback. Please find our responses below in the order of the review. We start with an important misunderstanding.

In practice, will someone take a lot of time to give feedback to the agent? ...

Major Clarification on the naturalness of user edits: The main motivation for using user edits is that they are naturally generated as users do edits for their own natural needs and not because we ask them or pay them to do so. This is different from comparison-based preferences that are typically explicitly elicited from annotators through crowd-sourcing.

To understand this, let’s imagine a writing application that has an LLM agent as a writing assistant. At any given time, a user may query the LLM agent to do something, e.g., “summarize this essay” say for tweeting about this essay. The LLM agent has access to everything that is on the application window (e.g., the essay) and can use it to generate a response. Once the LLM agent generates a response, the user may find that it is close to what they want but not quite fully right. So, they might make edits to it and they do so because they need the perfect response for their own goal and not because we ask them or because they care about improving the application. As they are doing these edits in the writing application, the LLM agent has free access to these user edits which we can use to improve the LLM agent.

We mentioned this in the introduction in Lines 30-31 (“a natural feedback….user edits…for their own final use”). In fact, our setting has an important feature that “every natural use of the agent yields an edit feedback for learning”, leading to never-ending learning. We state this in Line 51. We will emphasize this more in the revision.

Another useful thing about our setting is that we frame this as an online learning problem where there is no separate train or test setting, and the agent is evaluated on every query. This makes it directly applicable to the writing assistant example above where performance on each query matters.

“Can the cost function with Levenshtein edit distance currently reflect the gap in user preference?”

Purpose of edit distance: We want to minimize the user’s effort that arises due to edits they perform. They do these edits because the LLM response isn’t ideal. But after some time, we want them to perform fewer edits and ideally no edits. We use Levenshtein edit distance to measure the “cost of user effort”. If the edit distance is high, then the user ends up doing more work and so the agent should be penalized more. Therefore, edit distance serves as our primary metric of evaluation as it directly evaluates user effort.

Now with the PRELUDE framework, we hypothesize that edits happen because of latent user preferences that are not expressed in the user prompt, and which if we knew about we could minimize the edits. Thus, learning the right preference is not the end goal but is a way to minimize the user effort as reflected in edit distance. That said, learned preferences have the added advantage of also providing interpretability.

Our experiments show that our hypothesis is indeed true as the Oracle baseline that uses the true preference achieves minimal edit distance in Table 2. However, it is possible that there are cases where the learned preference is close to the true preference but where edits are higher, or vice versa. We are happy to add a correlation plot in the revision between “accuracy” which measures the preference accuracy and “edit distance” across all 200 rounds to visualize this.

The embedding methods .... do not have a fine-tuning process....

Fine-tuning Embedding Method. Our main algorithm CIPHER is compatible with several complementary advances happening in the field of LLM prompting. For one, we can directly use a better retrieval method instead of the cosine-similarity with BERT/MPNET embeddings that we use. This can be done by simply replacing Line 3 in Algorithm 1 with a different retrieval mechanism. Our main goal in this paper is not to exhaustively evaluate all these approaches but instead, initiate the line of improving LLMs with user-edits and doing so by learning latent preference descriptions. We believe improving retrieval using learned embedding methods or using more complex language agents are all interesting future work directions. Specifically, we believe approaches such as Mysore et al., 2023 (cited in the paper) that learn the retrieval function can be relevant in our case.

"The user simulator .... it can align with humans in the real world.”

Alignment with real user: We provide indirect evidence of this alignment in the pilot study in Section 4.4 which shows that the ordering of models in Table 2 that learn from GPT-4 user is the same when evaluated by human win-rate. A large-scale study with real users can answer this more conclusively and is a natural future work direction.

“The matrix of accuracy ...does not mention it in the paper.”

Accuracy metric: The accuracy metric is defined on Lines 187-194. We use BERTScore function to compute this. We will revise the text to clarify that the metric “accuracy” in Table 2 refers to this particular metric and we apologize for the confusion.

The author does not provide open access to the source code and data, but the author chooses "YES" in checklist 5.

Code: We apologize for the error in the checklist. We will fix this along with any other errors. We will release the code with the camera ready and are happy to provide a copy of the code at this dropbox link

We have already linked the existing datasets we use in Table 4 in Appendix.

2024-08-08

Thanks for the rebuttal by the authors. I would like to raise my score to 4, and agree to accept this paper if most of us tend to.

审稿意见

评分: 7置信度: 32024-07-13

This paper investigates interactive LLM alignment by analyzing user edits to an agent's responses. The proposed framework, PRELUDE, enhances the agent's alignment with user preferences without extensive fine-tuning, thus avoiding high costs and potential performance degradation on other tasks. PRELUDE infers user preferences from historical edits and uses them to guide response generation. The study introduces CIPHER, an algorithm using a LLM to predict user preferences based on context and past edits. Tested in summarization and email writing scenarios, CIPHER demonstrates good performance over other methods by reducing edit distances and computational costs, while maintaining user preference alignment.

优点

Study an interesting and important problem
The proposed framework is overall intuition and reasonable
The paper presentation is clear and very accessible

缺点

No significant weaknesses but some method design choices can be further elaborated (better motivated). See below questions for more details.

问题

What is (and how to decide) the granularity of each edit? Take the example in Figure 1 for instance, there are lots of edits on $y_t$ to obtain the $y_t'$ . Are you merging all edits together? If yes, then the value $T$ in Protocol 1 is simply 1? But this seems contradicts to the statement in line 165 where the $T=200$ . So are you treating each single modification as one edit? More discussions on this part is needed.
What is included in the context $x_t$ when $t>1$ in Protocol 2 and Algorithm 1?
What is the "Accuracy" metric in Table 2? It seems this metric is not included section 4.1?

局限性

The author does provide some failure case analysis in Appendix.

作者回复

2024-08-07

We thank the reviewer for their useful feedback. Please find our responses below in the order in the review.

What is (and how to decide) the granularity of each edit? .... So are you treating each single modification as one edit? More discussions on this part is needed.

Clarification on edits: Each round represents a single query from the user to the LLM agent (e.g., the writing assistant). Figure 1 shows a single round in detail. The LLM agent is given a context and has to generate a response. In Figure 1, this context is Article: {user-provided article} Please summarize the above article. where {user-provided article} is replaced by the given article. The user then makes edits to this summary as shown in Figure 1. The edits here represent all token deletions, additions, substitutions, etc. that the user makes to the given response. In the next round, the user will ask the LLM agent about very likely a different article and may have a different request. In our experiments, we evaluate the agent on 200 rounds where each round contains a different context. Specifically, for the email writing example, there is a request to summarize 200 separate documents, one document per round. Importantly, note that different rounds are not different edits to the same document!

Note that our setup is an online learning setup where the agent is evaluated on each round and there is no separate train/test split. This is because, in real-world personalization tasks, there is no separate train and test phase.

What is included in the context $x_t$ when $t>1$ in Protocol 2 and Algorithm 1?

Content of the context: A context includes everything that is given to the LLM agent as input. Suppose we are in a writing application and there is a field where the user can write a query to an LLM agent. Then the context will include the user query along with content on the screen of the application. The context may include other things such as whatever the application knows about the user (e.g, their calendar, etc.). Figure 1 illustrates this setting where the context is Article: {user-provided article} Please summarize the above article. where {user-provided article} is replaced by the given article.

What is the "Accuracy" metric in Table 2? It seems this metric is not included section 4.1?

Accuracy metric. The accuracy metric measures the accuracy of the inferred preference. This metric is defined on Line 187-194 but we didnt specify the name “accuracy” in the text that we used in Table 2. We apologize for this confusion and we will state it clearly in the revision.

In summary, each context contains a document that comes from one of the $\mathcal{S}$ document sources. Each document source has a unique user preference. This is designed to capture the real-world phenomenon where users have different preferences depending upon what they are doing (e.g., writing an email to a friend or writing reviews for a conference). We evaluate a 0-1 score depending upon whether the BERTScore similarity of an inferred preference for document source $d \in \mathcal{S}$ is closer to the true preference of $d$ , than to the true preference of any other document source $d' \ne d$ .

2024-08-11

Thanks for the rebuttal which addresses most of my questions. This is to confirm that I have read the author rebuttal.

审稿意见

评分: 7置信度: 42024-07-17

The paper discusses interactive learning of LLM-based language agents based on user edits on the agent’s output. It first proposes the framework that infers a description of the user’s latent preferences based on historical edit data, and then uses an LLM to infer these user preferences. The proposed solution is tested on two tasks involving interactive environments: summarization and email writing.

优点

The paper is well-written, and the proposed algorithm/framework is well-illustrated. It also discusses an important topic: the interactive learning of LLM-based agents. Additionally, the experiments are comprehensive, including discussions on the qualitative analysis of learned preferences, human evaluation, failure case analysis.

缺点

The paper leverages LLM to learn user preferences, which is sometimes not very reliable. Additionally, the retrieval of historical examples might not be relevant to the given context. I believe the authors have recognized these drawbacks and mentioned them in the limitations section.

问题

What's your future plan to enhance the quality of the learned preference?

局限性

The authors have adequately addressed the limitations

作者回复

2024-08-07

We thank the reviewer for their useful feedback. Please find our responses below in the order of the review.

There are complementary improvements happening in the general field of LLM prompting, inference, and planning. Some of the future work directions in which we can incorporate these ideas to improve CIPHER are as follows:

CIPHER can be used with any language agent which might in itself can be a complex system (e.g., using multiple LLMs or a separate memory). In this paper, we used a simple language agent that does greedy decoding of an LLM given a prompt. Going ahead, experimenting with more complex agents especially those focused on planning and reasoning is a useful direction and can help further improve the learning.
Another direction is to improve the quality of preference as suggested by the review. One way to do this is to learn a retrieval model instead of relying on static BERT and MPNet embeddings. Existing approaches that learn a retrieval model can be relevant for this [e.g., Mysore et al., 2023].
Another direction is that instead of using cosine similarity, we can learn a dense model that takes both contexts as input and predicts a similarity score. These dense models can better capture similarity although they are more expensive to use. A hybrid approach can also be useful to balance these trade-offs, where we use cosine similarity to narrow down the choices, and then use a dense model over these choices. These retrieval methods can be directly incorporated into Algorithm 1 by simply replacing Line 3 with these methods.

Reference

PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers, Mysore et al. 2023 https://arxiv.org/pdf/2311.09180.

审稿意见

评分: 7置信度: 42024-07-19

The paper first proposes, PRELUDE, session level personalization for a writing assistance task. Here a model must learn natural language user preferences from edits made by the user to outputs generated a model during a session. Next the paper proposes an algorithm, CIPHER, which leverages an LLM to infer natural language preferences given edits from a (simulated GPT4) user. The approach is evaluated in two writing tasks: summarization and email-writing and compared to a host of reasonable baselines and evaluated for its ability to reduce cumulative edit costs by the user in a session and for CIPHER's ability to recognize the latent preference.

优点

The proposed edit based natural language preference learning task is novel and represents meaningful contribution to the field. While the proposed approach, CIPHER is does not make any significant technical advances it represents a reasonable first approach to the task and is worth exploring at the outset.
The proposed task also represents a natural setting in writing assistance making it a meaningful problem to explore.
The paper is well written for the most part.
The evaluation setup in the paper is clever for evaluating the proposed task and could be useful to future work in this space.

缺点

Some aspects of the proposed experimental setup seem quite simplistic or unrealistic for the proposed task: 1) It appears that the task of identifying "user preference" is mainly a reframing of task identification, the number of tasks in the two domains explored is small (5 summarization tasks, 4 email writing tasks). In realistic personalization scenarios I would expect a greater number of preferences. However, this drawback is likely a necessary part of initial work in this space and not a reason for rejection IMO. 2) The number of rounds in a session appears to be very high, T = 200. I understand this to be the number of times a user makes an edit to an output - its unclear if this is realistic or if one should expect CIPHER predictions for smaller T to be significantly degraded.

问题

Line 131, "We use cosine similarity for computing proximity..": Are the context and preference concated to obtain an embedding for computing cosine similarity?
Please consider citing work on instruction induction (https://arxiv.org/abs/2205.10782 and follow ups) and discuss its differences to the proposed task and approach.
Do I understand correctly that there are 5 summarization (sub-)tasks and 4 email writing (sub-)tasks and this could be used to compute accuracy of a random predictor for the preference?
Table 2 or Fig 2: It would be meaningful to know the value of T at which maximum accuracy was reached for task/preference identification. Please consider reporting this in Table 2 or reporting the accuracy dynamics plotted against T.
ICL-edit descriptions: What does the notation y_z_t refer to? Over what corpus is this retrieval performed? Is it reasonable to think that ICL-edit would have higher performance at lower T and also need to make fewer calls to the LLM? If yes, please consider discussing this tradeoff, this would not be a slight on CIPHER.
Thinking of the task: Is it reasonable to think that some instances see a much higher edit distance than others - in these cases it would easier for the user to reword the prompt to generate a wholly new output rather than edit the output. Or an edit would amount to a complete re-write. This was noted in prior work on editing model outputs in work on speech transcription: https://www.cs.cmu.edu/~jbigham/pubs/pdfs/2016/asr-threshold.pdf. Did your exploration/analysis of model results spot such instances? I wonder if there is scope in future work for only inferring preferences from edits in some sessions and suggesting a prompt re-write in others. It would be meaningful to add a brief discussion of this to the paper.

局限性

作者回复

2024-08-07

We thank the reviewer for their useful feedback. Please find our responses below in the order in the review.

The number of rounds in a session appears to be very high...

Evaluation at different T: Since our setup is an online learning setting, we can evaluate the method and baseline at any value of “T” up to the value we tried. This is because there is no separate train and test steps. To do this, please look at Figure 2 and just choose any value of T on the x-axis upto to the maximum we tried (which is 200) and look at the cost values at the chosen x-axis. Note that we don’t see gains with a handful of examples since preference varies across contexts, and you need to see enough variety and then also learn the underlying preference before we can properly learn.

.... Are the context and preference concated to obtain an embedding for computing cosine similarity?

Cosine similarity computation: The cosine similarity is computed only on context embeddings. Given the context in the current round, we compute its embedding and compare the cosine similarity of it against all context embeddings in the previous rounds. We then compute the closest $k$ -previous context embeddings and retrieve the inferred preference associated with these context embeddings. The idea here is that we want to use the preference of similar contexts, and not necessarily just find similar past preferences.

The cosine similarity is computed based on two context embeddings: one is the embedding of the given context at the current round, and one is the embedding of context provided in past rounds (i.e. context in the history). This similarity computation is for retrieving the most relevant context in the history with respect to the given context.

Please consider citing work on instruction induction...

Instruction Induction: We will be happy to cite and discuss the instruction induction paper and the follow ups.

this could be used to compute accuracy of a random predictor for the preference?....

Clarification on sub-tasks and evaluating preference prediction: Both summarization and email writing have document from different sources (5 for summarization and 4 for email writing) which have different hidden user preference. This is an attempt to capture real world scenario where a user’s preference is different across context (e.g., writing email to a friend versus writing a conference review). The document source is hidden from the agent. Further, the documents, i.e., context, from different sources are shuffled and presented in a random order to the agent to avoid any temporal correlation.

The CIPHER algorithm (or any algorithm implementing the PRELUDE framework) infers a preference description for each context (document). Each document has a hidden user preference that is determined based on its source. We compute a 0-1 accuracy based on whether the BERTScore of the inferred preference and the true preference of the document source is closer than that of the inferred preference and the true preference of any other document source. Note that this score (called accuracy in Table 2) has no learned parameters (see Line 187-192 for more details).

value of T at which maximum accuracy was reached...

Plotting accuracy vs $T$ : We would be happy to include this plot in the revision.

What does the notation $y_{z_t}$ refer to? Over what corpus is this retrieval performed?....

ICL-edit descriptions, source of retrieval, notation and tradeoff: We want to clarify that all retrievals in this paper happen only on the agent’s past history. We do not use external sources for performing retrieval. This avoids mis-alignment errors arising due to the choice of retrieval corpus. Although using a separate retrieval corpus could be a nice extension.

As shown in Protocol 1 and 2, we have $T$ rounds of online interaction where in round $t$ we have the agent response as $y_t$ . Therefore, $y_{z_l}$ refers to agent response in round $z_l$ . We explain $z_l$ more below.

For ICL-edit baseline, in round t we find the $k$ closest past rounds based on cosine similarity between current round’s context and previous round context.. Thus, $z_1$ , $z_2$ , …, $z_k$ refers to the index of the previous round that we retrieve. Therefore, $y_{z_l}$ refers to agent response in round $z_l$ which is the $l^{th}$ retrieved past round.

...Is it reasonable to think that ICL-edit would have higher performance at lower T...

ICL-edit performance at lower T: We can evaluate ICL-edit performance for any value of T in Figure 2 as explained in point 1 above. We can see that the gap between baselines in the beginning is small although CIPHER-5-M does slightly better than both ICL baselines. However, at the end CIPHER gets the best performance.

ICL-edit agent always makes a single LLM call across all round as we do not use LLM for retrieving. We instead use cosine similarity for retrieving examples. We create a prompt using the retrieved examples and the context and generate a response using the LLM.

...in future work for only inferring preferences from edits in some sessions and suggesting a prompt re-write in others...

Mixing user edits and prompt rewrite: This is a very interesting question. In our experiments, we did not observe cases of complete re-write, and this could be either because GPT-4 LLM is good at generating a reasonable response, or that we need to work on more complex problems where LLMs struggle more before we see these cases. In real-world deployments, it is possible that there are cases where the entire output needs to be edited, and users can then choose other feedback mechanism like prompt re-writing, or language feedback. We would add a discussion on this direction and learning from these heterogenous feedbacks is an important future work direction.

评论- Thank you

2024-08-11

Thank you for the responses. Please be sure to incorporate your responses into the paper.

最终决定Accept (poster)

2024-09-25

The paper investigates interactive learning of language agents by analyzing user edits to an agent's responses, and proposes a framework, PRELUDE, to learn natural language user preferences from edits made by the user to outputs. An algorithm, CIPHER, is introduced to infer natural language preferences given edits from a (simulated GPT4) user. The approach is evaluated in two writing tasks: summarization and email-writing, and compared to reasonable baselines.

Reviewers agree on the following:

Strength:

The experiments are comprehensive.
The paper addresses an important problem in the interactive learning of LLM-based agents, and the proposed framework is overall intuitive and reasonable.
The paper is well-written and easy to follow, and the proposed algorithm/framework is well-illustrated.

Weakness:

Oversimplified setup: the latent preferences consists of 5 for summarization tasks and 4 for email writing tasks. In realistic personalization scenarios there would be a greater number of personalized preferences.
The GPT-4 implemented user simulator should be evaluated to ensure it aligns with real human behavior.

Overall the reviewers have a similar opinion that the framework is novel and experiments are comprehensive, therefore I am leaning towards an accept.