PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
6
7
4
4.0
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

AGILE: A Novel Reinforcement Learning Framework of LLM Agents

OpenReviewPDF
提交: 2024-05-13更新: 2025-01-03

摘要

关键词
LLM agentreinforcement learningLLM-human interaction

评审与讨论

审稿意见
5

This paper jointly introduces an evaluation environment for LLM agents and an LLM agent architecture called AGILE.

The environments is a Question-Answering environment concerning the characteristics of commercial products sourced from Amazon. The benchmark itself is composed of 3 tasks: 1) a fact retrieval task where the agent should retrieve information about a product, 2) a product query task where the agent should retrieve a product that matches query criteria, and 3) a reasoning task where the agent should reason from the knowledge it has about a product to answer questions about possible uses or suitability of a product.

The agent is composed of an LLM backbone that, under the control of an executive module, can retrieve memories of previous interactions stored in its database and use tools. Tools are used for agentic flow (reflect, submit answer), for memory management (retrieve memory, put new memory in database), and for asking advice from a human expert (access to ground truth about a product). The LLM is finetuned with SFT on these tasks. The agent is further fine tuned on a scored version of the tasks with reinforcement learning (PPO).

The authors demonstrate the superiority of their architecture on their proposed environment and on a medical QA benchmark over one-shot prompting gpt4, and demonstrate that reflection, tool use, memory, ground truth product advice, and reinforcement learning contribute positively, with varying degrees, to agent performance.

优点

  • This paper combines known LLM agent modules and studies their impact in two question-answering domains that are of interest to industry practitioners: product recommendation and medical knowledge;
  • Agents asking for help has been studied at length in the embodied question answering literature with previous-generation agents (see for instance https://arxiv.org/abs/1909.01871 or https://arxiv.org/abs/2302.04865), but as far as I am aware has not been studied with LLM-based agents;
  • The paper is fairly clear and easy to read, the proposed contributions are easy to understand, examples are provided for the task, and the results are comprehensive;
  • The claims of the paper are well substantiated with extensive experiments, and the ablations allow to easily disentangle the effect of different components of the agent;
  • The data collection process for creating the benchmark seems sound and the task looks challenging and interesting;
  • Reinforcement-Learning studies of complete LLM agents are rare and needed.
  • The tables in the related work makes the relationship of this work and other previous art clearer.

缺点

  • An overarching and pervasive weakness of this paper that is difficult to ignore and that has a great influence on my judgement is the claim of novelty of the agent architecture. In a post-Voyager (https://arxiv.org/abs/2305.16291) world, LLM-based agents equipped with tool use and retrieval are standard and the AGILE agent presented in this work introduces little conceptual novelty in the agent architecture, while even the title of the paper claims that the agent is novel. LLM agents of this type are so much part of the landscape that many reviews exist that organize existing approaches (see for instance the excellent https://arxiv.org/abs/2309.02427 that is not that recent either). The approach is standard enough that advanced software stacks (langchain, llamaindex) are devoted to speeding up building retrieval-augmented generation (RAG)-based agents for industry practitioners. The paper framing should be updated accordingly to reflect that the approach being studied is standard, but applied to a new task.
  • Relatedly, it would have been important for the authors to exactly pinpoint the difference between their agent and for instance Voyager, and either implement baselines from existing work on their task or test their agent on other tasks (eg WebShop which is very similar in spirit).
  • There are still novel elements in the proposed approach. The first is the use of reinforcement learning. While the use of reinforcement learning to improve language agents is not novel (see https://arxiv.org/abs/2302.02662 for a textworld task or https://arxiv.org/abs/2403.04642 for a math word task), the use of RL in RAG agents has not been studied as far as I am aware. It also makes up a significant portion of the method section and the appendix, however the experiments demonstrate only a marginal improvement due to RL. Why is this the case? How are the training curves like? Has the training converged? I also do not see any standard deviation information for the RL part. The authors justify the lack of error bars with the computational cost of experiments with is fair and completely acceptable where the comparison between methods is straightforward, but would have been helpful in the sft/ppo comparison where the difference is small and could have been due to chance, and it makes sense to concentrate computational effort on a truly novel part of the contribution;
  • The other novel element of this paper is the asking for advice part: the model can opt out of the task by seeing the ground truth which it can then report, at a cost (for the RL agent). We can see that allowing this action to the agent increases its accuracy, that increasing the cost of advice results in RL agents using advice less often, and that memoryless agents use advice more often. This is interesting, and could have been a core contribution of the paper, if it were explicit and investigated as such. The conditions under which LLM agents are aware of their own knowledge, as well as them taking appropriate steps to reduce lack of information is a research programme in itself, as has been discussed by John Schulman in his Berkeley lecture in 2023: https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=215s Can the authors predict when the model is going to ask for advice? Does the RL training help the model know what it knows (metacognitive skills)? These are interesting questions that the authors could have investigated;
  • Relatedly, the related work paragraph on uncertainty looks unfinished. Relevant work on LLM’s metacognitive abilities are cited but there is no word on how the current work contributes to the discussion;
  • (minor) Time of training and number of gpus required for training should appear after the type of gpu; does the training use lora?
  • The evaluation for longform answers is based on GPT4, but no evidence is presented to correlate agreement of GPT4 with human judges on this task.

问题

  • My first and most important suggestion is that the authors should change the framing of the paper. The experimental work is sound and the tasks are worthwhile, but in the current state of the paper the conclusions seem to be that RAG agents work better than prompting on question-answering tasks, which is known. However, the paper could be: 1) a study on RL in RAG agent settings, 2) a study on training metacognitive abilities of RAG agents, and/or 3) a paper introducing a new benchmark (but the tasks seem not that challenging compared with tasks like SWE-bench https://www.swebench.com/ or reasoning tasks like theorem proving https://github.com/openai/miniF2F, https://leandojo.org/).
  • [Conclusion, l305] why is LLM system 1 and interaction with tools system 2? What does it mean? If the claim is to be left in the paper it should be explained and adequate citations should be provided (also is this analogy fruitful? What purpose does it serve? is this a claim about the cognitive plausibility of the proposed agent architecture or its resemblance to humans?).
  • [Clarity] The paper should explain what mcqa is when first mentioned, since it is confusing for people not familiar with the benchmark. Same goes for the meerkat model.
  • Why is the improvement from RL only 2%?
  • The original RAG paper should be cited: https://arxiv.org/abs/2005.11401

局限性

  • Some limitations with respect to lack of resources are acknowledged in the appendix (small open source models are trained, only a subset of the training set is used). This is fine; the methods used are pretty powerful and can be expected to scale with increasing compute.
  • If understood as a way to help agents answer queries, expert advice is definitely non-scalable. It is fine as a research question, maybe less so for modelling real-world applications. This could be mitigated with long-term memory of the agent (give advice once, retrieve forever). In any case, a discussion on the feasibility of expert advice in real world scenarios would be nice.
评论

We sincerely thank you for your effort and valuable comments in reviewing our paper. We appreciate your recognition of our contributions and your insightful feedback. We have addressed these concerns in the Rebuttal Section within the given time constraints to the best of our ability.


Please let us know if our responses address your concerns. We are grateful for your time and effort in helping us enhance our work.

作者回复

Q1 (Weakness 1 & Question 1): Paper framing

A1: Regarding novelty, we would like to note that we have summarized our core contributions at the end of the Introduction section [line 65], aligning exactly with your suggestions. The Methods section also aligns with your summary. We acknowledge that the "novel framework" statement was somewhat overstated. Hence, we will revise our paper's title to: "AGILE: A Novel Reinforcement Learning Framework for LLM Agents," to emphasize that our main contribution is the end-to-end RL formulation and training of LLM agents. We will also properly update other parts of the paper to clearly convey this point.

Beyond the RL aspects, we want to highlight other novel contributions of our paper. First, training the agent to proactively seek advice requires it to silently estimate its own confidence before making decisions. This capability cannot be achieved through SFT alone, further demonstrating the importance of RL training.

Second, the ProductQA dataset is a novel contribution. It is the first agent task involving very long trajectories, with each containing thousands of correlated QA sessions and millions of tokens in total. The actions the agent takes in one session influence all future sessions. This characteristic is markedly different from other benchmarks where the impact of actions is much shorter-term.


Q2 (Weakness 2): Difference between AGILE and other agents

A2: Agents like Voyager are based on prompt engineering, while AGILE is a RL formulated framework that allows end-to-end training. Application of AGILE to other tasks, such as WebShop, is left to future work.


Q3 (Weakness 3 & Question 4): Additional results of RL training

A3: First, we provide training curves in the global response PDF file, which indicates that RL training converged after about 500 steps.

Second, following your suggestion, we conducted multiple independent trials of PPO training to study the variation of the result (see Table 2 in the PDF file). On average, RL training improves the total score by 2.6%, with a standard deviation of 0.3%, demonstrating the significance of RL improvements.

Third, we would like to discuss why RL training improved performance by a moderate 2%. We believe the reasons are: (1) The agile-vic13b-sft model is a strong baseline as it imitates the policy of the agile-gpt4-prompt; (2) SeekAdvice decisions made by agile-gpt4-prompt are nearly optimal under the default advice cost (c=0.3). To further study the impact of RL training, we conducted additional experiments in two more settings:

  • We re-generated SFT training data for agile-vic13b-sft such that the agent performs SeekAdvice randomly in 25% of cases. This initial policy is simpler but more general. In this setting, we name the SFT model agile-vic13b-sft-random, and the final model trained with RL on top of it agile-vic13b-ppo-random. As shown in the table in global response, RL training brings a 7.1% improvement in this setting. Interestingly, the performance of agile-vic13b-ppo-random is better than that of agile-vic13b-ppo that we reported in the paper. We conjecture that random SeekAdvice is a better initial policy because it enables exploration in all directions.

  • In the second experiment, we lowered the advice cost to 0.1. After PPO training, the agile-vic13b-ppo-random agent quickly adapted to the new cost, performing SeekAdvice much more aggressively than the initial agent trained by SFT. In this scenario, RL training brings a 22.3% improvement.

See Table 3 in the PDF file for concrete numbers.


Q4 (Weakness 4): RL training for metacognitive skills

A4: RL training helps the model's metacognitive skills. Experiments on ProductQA and MedMCQA datasets show that, after PPO training, the agent seeks advice less often and achieves higher accuracy, indicating improved self-assessment in resolving queries versus seeking assistance.

Q5 (Weakness 5): Related work and our contributions to the topic of uncertainty

A5: We note that AGILE solves a complex self-evaluation problem because the agent must make multiple seeking advice decisions in a long trajectory, each having a long-term influence on future decisions. This differs from existing work in uncertainty. We will add discussion to the revised version.


Q6 (Weakness 6): Training details

A6: On ProductQA, SFT takes 3.6 hours and PPO takes 5.5 hours on 8 H800. On MedMCQA, SFT takes 0.9 hours and PPO takes 2 hours. The LLM is trained without using LoRA.


Q7 (Weakness 7): Agreement between GPT-4 evaluator and human

A7: We conducted an additional evaluation by randomly selecting 100 triplets (question, reference long answer, model-predicted long answer) from ProductQA and manually labeled the correctness. Our results show a 94% agreement rate between the GPT-4 evaluator and the author.


Q8 (Question 2): Analogy to System 1 and System 2 processes

A8: According to the dual-process theory [1], human thinking involves two systems: System 1 (fast and unconscious) and System 2 (slow and conscious). Recent research has explored AI systems incorporating both processes [2]. We propose that AGILE uses LLM for System 1 and integrates external tools for System 2, mirroring human thinking. We will include further discussion in the revised version.

[1] D. Kahneman. 2003. Maps of bounded rationality: Psychology for behavioral economics.

[2] Y. Bengio, Y. Lecun, and G. Hinton. 2021. Deep learning for AI.


Q9 (Question 3): Further explanations of MCQA and the Meerkat model

A9: MedMCQA is a multiple-choice QA dataset from medical school entrance exams. Meerkat is a medical LLM trained with high-quality CoT reasoning paths from 18 medical textbooks and diverse instruction-following datasets. We will include these explanations in the revised version.


Q10 (Question 5): Citation of RAG paper

A10: We will add it to the paper.

评论

I thank you for your reply, that answers some of my questions. I am especially happy that you took the time to perform several seeds of the RL experiments (the results are much more robust now) and that you performed comparison with humans judgements in the answer to Q7 (I now trust the results of evaluations).

I also appreciate that you performed additional RL experiments to try and show it helps compared to the strong stf baseline. To make sure I really understand: the original sft policy had 0% [SeekAdvice] rates? And what your new experiment does is start out with higher advice rates and show it leads to more efficient use of human advice (esp when cost of advice falls to 0.1, effectively solving the task by soliciting advice more than half of the time)?

As to your remark:

A2: Agents like Voyager are based on prompt engineering

I think nothing prevents researchers to build a Voyager agent (with an open weights model) and perform RL on it right?

I also appreciate that you emphasize that you build a new task, but from what I can see it doesn't seem very hard since agents (with quite standard architecture, again I think the type of agent you implement is quite common, see also https://arxiv.org/abs/2304.03442) based on 13b models, without RL or search, are able to solve most of it. That's why I was calling for evaluations on other datasets, to be able to compare. (ReAct is not a very strong baseline). I appreciate your new experiments on HotpotQA, and would be interested to know how it compares to the SotA listed in the benchmark homepage (in terms of metrics), and how you implemented advice in the case of this dataset.

评论

Thank you for your valuable feedback. We would like to address each of your questions as follows.


Q1: I also appreciate that you performed additional RL experiments to try and show it helps compared to the strong stf baseline. To make sure I really understand: the original sft policy had 0% [SeekAdvice] rates? And what your new experiment does is start out with higher advice rates and show it leads to more efficient use of human advice (esp when cost of advice falls to 0.1, effectively solving the task by soliciting advice more than half of the time)?

A1: In all training experiments, we consider three types of SeekAdvice rates: 1) the rate in the SFT training data, 2) the rate predicted by the SFT agent after training, and 3) the rate predicted by the RL agent after training.

For both the original and additional experiments, the SeekAdvice rate in the SFT training data (rate 1) is 25%. In the original experiment, GPT-4 makes SeekAdvice decisions. In the additional experiment, these decisions are made randomly but maintain the 25% rate.

In the original experiment, the SFT agent (agile-vic13b-sft) predicts SeekAdvice in 25.6% of cases, closely matching the training data. In contrast, in the additional experiment, the SFT agent (agile-vic13b-sft-random) predicts SeekAdvice in only 1.4% of cases. This discrepancy is due to the LLM's greedy decoding. Although it predicts SeekAdvice with logit scores corresponding to a 25% probability, the greedy decoder typically selects other actions with higher logit scores. We also tried random sampling decoding, and the results are attached to the table below.

In the original experiment, the RL agent (agile-vic13b-ppo) has a SeekAdvice rate of 23.3%, while in the additional experiment, the RL agent (agile-vic13b-ppo-random) has a rate of 30.6%. The new agent achieves a higher overall score, likely because it starts from a simpler and more general initial policy, allowing PPO to find a better final policy. In contrast, the original agent starts with a strong GPT-4 policy and only makes slight adjustments.

When the SeekAdvice cost is reduced to 0.1, PPO training converges to a much more aggressive SeekAdvice rate of 67.1%. This experiment demonstrates that RL training is sensitive to the SeekAdvice cost, always optimizing the policy under specific costs. In contrast, the GPT-4 agent remains insensitive to cost changes, even when the cost is provided as input. The SFT agent is also insensitive since generating optimal SFT data for a given cost is difficult.

Seeking Advice Cost[SeekAdvice] Rate in SFT Training DataGreedy DecodingGreedy DecodingGreedy DecodingSampling DecodingSampling DecodingSampling Decoding
Advice RateAccuracyTotal ScoreAdvice RateAccuracyTotal Score
agile-vic13b-sft0.30.250.2560.8430.7660.3080.8390.747
agile-vic13b-ppo0.3-0.2330.8540.7840.2780.8420.759
agile-vic13b-sft-random0.30.250.0140.7490.7450.2910.8230.736
agile-vic13b-ppo-random0.3-0.3060.890.7980.3630.8960.787
agile-vic13b-sft-random0.10.250.0140.7490.7480.2910.8230.794
agile-vic13b-ppo-random0.1-0.6710.9810.9140.5730.9410.884

Q2: I think nothing prevents researchers to build a Voyager agent (with an open weights model) and perform RL on it right?

A2: We agree with you that our main contribution is the end-to-end RL formulation and training of LLM agents.

In addition, our agent can manage very long trajectories. In the ProductQA task, there could be hundreds of question-answering rounds and could generate very long trajectories yield training sequences that span millions of tokens. To address the challenge, two key capabilities are necessary: 1) Some actions can erase or modify the context to prevent the context from infinitely extending and unmanageable for the LLM; 2) session-level RL training. These challenges cannot be resolved simply by applying PPO training on Voyager-like agents. We discuss the solutions to these issues in our work.

评论

Q3: I also appreciate that you emphasize that you build a new task, but from what I can see it doesn't seem very hard since agents (with quite standard architecture, again I think the type of agent you implement is quite common, see also https://arxiv.org/abs/2304.03442) based on 13b models, without RL or search, are able to solve most of it. That's why I was calling for evaluations on other datasets, to be able to compare. (ReAct is not a very strong baseline). I appreciate your new experiments on HotpotQA, and would be interested to know how it compares to the SotA listed in the benchmark homepage (in terms of metrics), and how you implemented advice in the case of this dataset.

A3: Thanks for your comments.

  1. Experiment implementation.

    In the HotPotQA task, the agent has the option to either use search tools, seek advice, or directly predict an answer. If the agent chooses to use search tools, it generates a search query to retrieve relevant information, which is then appended to the agent's context. If the agent seeks advice, it obtains a human answer (ground-truth answer in our setting).

  2. Comparison to SoTA listed in the benchmark homepage.

    Thank you for your reminder. We will include the SoTA results in our results table. However, we want to highlight two key differences between our AGILE agent and the methods on the leaderboard:

    • Our agent uses an external retrieval tool that returns the most relevant document based on a query. This retriever is the same as the one we used for ProductQA and MedicalQA and is not fine-tuned on the HotpotQA dataset. During training, we focus solely on training the LLM of the agent. In contrast, the top five systems on the leaderboard [1,2,3] (except the second one, which lacks a reference link) all train task-specific retrieval models on the 90K HotpotQA training examples. Training such models significantly boost accuracy, and these works claim it as their main technical contribution. Our paper primarily studies the training of LLMs and treats external tools as black boxes, so using a task-independent retriever aligns better with our objectives.

    • The top systems on the leaderboard [1,2,3] train separate answer extraction models to extract answers from document spans. Our agent directly generates answers from the LLM. While the generative nature may lower exact match (EM) accuracy, we believe it enhances simplicity and generalizability. Besides EM accuracy, we also calculated a GPT-4 accuracy where answers are compared for correctness, recognizing (USA, United States) as correct. As shown in the table below, our system's actual accuracy is much higher than the EM accuracy.

    Regarding baselines, we noted the original ReAct baseline's implementation in [4] was suboptimal. We reproduced their results using GPT-4, leading to improved performance. In the table below, we also provide results of other works that use the same experimental setting as ours (using a black-box retrieval tool and generative answers). Our system's overall performance surpasses all these baselines.

MethodAdvice RateAccuracy (Exact match)Accuracy (GPT-4 Evaluator)Total Score (Exact match)
ReAct [4]-0.351--
ReAct (gpt-4)-0.482--
CRITIC [5]-0.443--
Expel [6]-0.390--
AutoAct [7]-0.384--
agile-gpt4-prompt0.1940.6640.8420.567
agile-vic13b-w/o Advice0.0000.5530.7510.553
agile-vic13b-w/o RL0.1710.6680.8570.617
agile-vic13b-ppo (ours)0.1560.6750.8580.628
Supervised SoTA [1]-0.727--

[1] J. Zhang, et al. End-to-End Beam Retrieval for Multi-Hop Question Answering, NAACL 2024.

[2] Z. Yin, et al. Rethinking label smoothing on multi-hop question answering, CCL 2023.

[3] XY. Li, et al. From easy to hard: Two-stage selector and reader for multi-hop question answering, ICASSP 2023.

[4] S. Yao, et al. ReAct: Synergizing reasoning and acting in language models, ICLR 2023.

[5] Z. Gou, et al. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, ICLR 2024.

[6] A. Zhao, et al. Expel: Llm agents are experiential learners, AAAI 2024.

[7] S. Qiao, et al. AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning, ACL 2024.

评论

Dear Reviewer b4yX,

Thank you for taking the time to review our manuscript. We sincerely appreciate your insightful feedback. As the deadline for the discussion phase nears, we would like to kindly remind you of our recent response, in which we diligently addressed the concerns you raised and provided detailed explanations.

Should you have any further questions or require additional clarifications, we would be eager to address them promptly.

Thanks again for your valuable feedback.

评论

Thank you for your thorough response!

In light of the new experiments and additional data, i find that some of my concerns have been addressed.

This paper:

  • introduces a long-horizon task that has real-world relevance, in which one can ask for expert advice;
  • introduces a RL formulation (and training algorithm) for LLM-agents;
  • shows that traditional LLM agent components help solve the task (known);
  • shows that RL training help solving the task, and, importantly, that RL training allows the agent to ask for advice when it is not able to solve the task, which leads to high scores when the cost of advice is low;
  • implements and tests the agent on HotPotQA where it achieves scores competitive with specialized baselines.

I am still concerned that RL training seems not so effective and that large-scale, cheap advice is an unrealistic scenario, but the agent learning to defer when it doesn't know the answer, as well as the experiments on HotpotQA have convinced me to raise my score. I would also kindly request that the formulation of the paper is amended to reflect the exact components of the method that are novel.

评论

We really appreciate your thoughtful feedback and recognition of our work's contributions. We are grateful for the improved score, which is truly encouraging. Your constructive feedback helped us enhance our work, and we will certainly integrate your suggestions into our revised version.

In response to your concerns about the effectiveness of RL training, we would like to highlight two key advantates:

  1. RL training enables the discovery of a better policy compared to the one obtained from SFT training alone.
  2. RL training is particularly sensitive to the SeekAdvice cost, optimizing the policy according to specific costs, whereas SFT training is insensitive since generating optimal SFT data for different cost is difficult.

Thank you again for your effort and valuable feedback, which helped us improve our paper.

审稿意见
6

This work presents a novel framework for LLM agents named AGILE. The entire AGILE system is trained end-to-end using reinforcement learning. A key feature of AGILE is its ability to seek advice from external human experts. Additionally, the authors have developed ProductQA, a challenging dataset of complex QA, to comprehensively evaluate the capabilities of the agent. Extensive experiments show that an agent within this framework, even when based on a smaller model and trained with RL, can outperform GPT-4.

优点

  1. The paper introduces a novel framework, AGILE, for LLM agents that integrates memory, tools, and expert consultations, all optimized through reinforcement learning.

  2. The development of the ProductQA dataset.

  3. The paper is well-organized and clearly written. The architecture and workflow of the AGILE framework are explained with clarity.

  4. The experiments on ProductQA and MedMCQA verify the AGILE framework and show that AGILE agents based on 13B and 7B LLMs trained with PPO can surpass GPT-4 agents.

缺点

  1. It is somewhat like WebGPT (in terms of actions and policy learning strategy), with extensions supporting memory, expert consultation, and reflection.

  2. The experiments are all related to a single agent. It would be better to show the application for multiple agents and include planning.

问题

  1. How can this framework be extended to support multiple agents?

  2. In Table 5, the accuracy of "w/o Memory" and "w/o Tool-Use" is higher than that of agile-vic13b-ppo. Is the metric "Accuracy – Advice Rate" a more straightforward one than Total Score?

局限性

The authors have addressed the limitations.

作者回复

We sincerely thank you for your positive feedback and effort in reviewing our paper. Thank you for your constructive questions. In our response, we will quote each question and provide our answers accordingly.

Response to comments


Q1 (Weakness 1): It is somewhat like WebGPT (in terms of actions and policy learning strategy), with extensions supporting memory, expert consultation, and reflection.

Answer: Thanks for bringing this to our attention. We would like to clarify the key distinctions between our proposed AGILE framework and WebGPT:

  • WebGPT primarily uses a policy learning strategy to train the agent for web operations and does not incorporate memory, expert consultation, or reflection. In contrast, AGILE emphasizes proactively seeking advice from human experts, allowing the agent to achieve high-level accuracy when handling complex and challenging questions. The agent can also enhance its performance over time by reflecting on expert feedback and memorizing it for future use.

  • While WebGPT utilizes reinforcement learning, our work addresses the significant challenge of training the policy for invoking various modules and the reasoning, planning, reflection, and seeking advice abilities of the LLM agent in an end-to-end manner. This challenge is particularly pronounced in long trajectory scenarios. To overcome this, we propose a session-level optimization algorithm that facilitates policy learning at the session level, thereby mitigating the difficulties associated with long trajectory optimization.

  • As shown in Table 5 in our paper, incorporating memory, seeking advice, and reflection into AGILE results in relative total score improvements of 4.0%, 5.0%, and 1.7%, respectively, on the ProductQA dataset. These results demonstrate the necessity of these modules for LLM-based agents in practical scenarios.


Q2 (Weakness 2): The experiments are all related to a single agent. It would be better to show the application for multiple agents and include planning.

Answer: We appreciate your valuable suggestions. Applying AGILE to multi-agent systems is indeed an interesting direction. The AGILE framework can be extended to facilitate interactions with machine agents in various roles, such as students or teachers, and in different formats, such as debates or coordination. For example, a multi-agent version of AGILE could seek advice from a human expert at a high cost or from a more capable machine agent at a lower cost. Furthermore, the planning for multiple agents can be enhanced using end-to-end RL training.

In our paper, we primarily focus on the RL formulation and end-to-end training of LLM agents, along with their seeking advice capabilities, which we believe are fundamental and essential for agent performance. Due to space and time limitations, the application of AGILE to multi-agent systems will be addressed in future work. Thank you for your constructive feedback.


Q3 (Question 1): How can this framework be extended to support multiple agents?

Answer: Thanks for your question. Please refer to our response to Q2.


Q4 (Question 2): In Table 5, the accuracy of "w/o Memory" and "w/o Tool-Use" is higher than that of agile-vic13b-ppo. Is the metric "Accuracy – Advice Rate" a more straightforward one than Total Score?

Answer: Thank you for this valuable comment. In Table 5, although the accuracy of "w/o Memory" and "w/o Tool-Use" is higher than that of agile-vic13b-ppo, they exhibit a significantly higher Advice Rate compared to agile-vic13b-ppo. This higher Advice Rate results in lower overall performance for "w/o Memory" and "w/o Tool-Use" when assessed using our Total Score metric.

In our paper, the total score is defined as the average reward across all sessions, taking both the advice rate and the accuracy into account. Specifically, the total score is expressed as "Total Score = Accuracy - Seeking Advice Cost * Advice Rate". This formulation encompasses the "Accuracy – Advice Rate" metric as a special case where the Seeking Advice Cost is set to 1.

It's important to note that the Seeking Advice Cost can vary across different real-world applications. For instance:

  • In a math problem-solving task, the required human expert might need high professional expertise in mathematics, resulting in a high cost (e.g., a cost of 1).
  • In product question-answering scenarios, the human expert could be a normal person who has received short-term customer service training, leading to a lower cost (e.g., a cost of 0.3).

Please let us know if our replies address your concerns. Thanks for taking the time to consider this discussion. We appreciate your time and effort in helping us improve our work.

评论

Thank the authors for the detailed response.

审稿意见
7

This paper proposes AGILE, a reinforcement-learning based framework for finetuning LLMs for conversational QA tasks. The models are initially trained using imitation learning, then further finetuned with RL. Once finetuned, the models show strong performance, surpassing GPT-4 while using much smaller models.

优点

  • The proposed framework is interesting and novel, distilling the usage of various tools, reflection, memory retrieval/writing, and human advice seeking from larger models' trajectories, to a smaller model, then further training it using RL to surpass the performance of the larger model
  • The performance is strong
  • Introduces a new dataset, ProductQA which is a useful resource for conversational QA in online shopping scenarios.
  • Cost-benefit analysis of advice seeking is insightful.

缺点

I only have two main concerns:

  • Because the model is only evaluated on two conversational QA datasets (and not on any of the mentioned datasets in Table 8), it's difficult to judge the generality of the proposed method.
  • There is no comparison with standard ppo

问题

What is the reason behind why there is not comparison with rl finetuning baseline (standard ppo)?

局限性

yes

作者回复

We genuinely appreciate your positive feedback and the time you invested in reviewing our paper. Thank you for your insightful questions. In our response, we will address each question individually, quoting them and providing our answers accordingly.

Response to comments


Q1 (Weakness 1): Because the model is only evaluated on two conversational QA datasets (and not on any of the mentioned datasets in Table 8), it's difficult to judge the generality of the proposed method.

A1: Thanks for the helpful suggestion. To address the concern regarding the generality of AGILE, we conducted additional experiments on the HotPotQA dataset, which is one of the tasks listed in Table 8. The main results and ablation study are presented below.

Compared with ReAct implemented by prompting GPT-4, our method (agile-vic13b-ppo) shows a 19.3% relative improvement in accuracy. Furthermore, agile-vic13b-ppo achieves a 10.7% relative improvement in the total score over agile-gpt4-prompt, which is the AGILE agent implemented by prompting GPT-4. The ablation study underscores the indispensability of seeking advice and PPO training in achieving the agent's strong performance.

MethodAdvice RateAccuracyTotal Score
ReAct (gpt-4)-0.482-
agile-gpt4-prompt0.1940.6640.567
agile-vic13b-w/o Advice0.0000.5530.553
agile-vic13b-w/o RL0.1710.6680.617
agile-vic13b-ppo (ours)0.1560.6750.628

Q2 (Weakness 2 & Question 1): What is the reason behind why there is not comparison with rl finetuning baseline (standard ppo)?

A2: Thanks for your question. In the ProductQA task, there could be hundreds of question-answering rounds. These rounds are correlated. Actions in earlier rounds can write the memory, thus have lasting effects on subsequent rounds. For example, knowledge distilled from seeking advice can help respond in the future. These long trajectories yield training sequences that span millions of tokens and are impractical for standard PPO training. To address this issue, we proposed a session-level training algorithm that takes lasting effect into account, which is detailed in Appendix A of our paper.


Please let us know if our replies address your concerns. Thanks for taking the time to consider this discussion. We appreciate your time and effort in helping us improve our work.

评论

Thank you to the authors for the additional results. This response addressed my concerns.

审稿意见
4

The paper "AGILE: A Novel Framework of LLM Agents" introduces a new framework for Large Language Model (LLM) agents designed to handle complex conversational tasks. The framework, named AGILE (AGent that Interacts and Learns from Environments), incorporates LLMs, memory, tools, and interactions with experts. AGILE is formulated as a reinforcement learning (RL) problem and is fine-tuned using Proximal Policy Optimization (PPO). The authors present a new dataset, ProductQA, for evaluating the framework and report that AGILE agents outperform GPT-4 in their experiments. The paper claims significant improvements in performance due to the integration of memory, tools, expert consultation, and RL training.

优点

Comprehensive Evaluation: The creation of the ProductQA dataset and the extensive experiments conducted on both ProductQA and MedMCQA provide a robust evaluation of the framework’s capabilities.

Significant Performance Improvements: The reported improvements over GPT-4 agents in both ProductQA and MedMCQA are noteworthy, indicating the effectiveness of the AGILE framework.

Detailed Methodology: The paper provides a thorough explanation of the RL formulation, training processes, and the roles of different components within the framework, which enhances reproducibility.

缺点

Limited Novelty: The reliance on human experts for advice, while useful, is not a novel concept and has been explored in previous works and becomes a common practice. The paper does not sufficiently differentiate its approach to human expert interaction from existing methods.

Scalability Concerns: The scalability of AGILE with more complex environments is not addressed. It is unclear how well the framework would perform with task complexity.

问题

Benchmark Selection: Why were ProductQA and MedMCQA specifically chosen as the benchmarks for this study? Are there other benchmarks where AGILE could be tested to validate its generalizability?

Human Expert Advice: Is there a ground truth for when the agent needs to seek advice, and how is this optimized?

Memory Component: Can you provide more details on how the memory component scales with an increasing number of interactions and how it ensures efficient retrieval of relevant information?

局限性

See weakness

评论

We sincerely appreciate your effort and valuable comments in reviewing our paper. Your recognition of our contributions and your insightful feedback are greatly valued. We have addressed your concerns in the Rebuttal Section to the best of our ability within the given time constraints.


Please let us know if our responses satisfactorily address your concerns. We are grateful for your time and effort in helping us improve our work.

作者回复

Q1: Limited Novelty: The reliance on human experts for advice, while useful, is not a novel concept and has been explored in previous works and becomes a common practice. The paper does not sufficiently differentiate its approach to human expert interaction from existing methods.

A1: Thanks for bringing this to our attention. In AGILE framework, agent can proactively seek advice from human experts when its confidence is low. This is distinct from existing human-in-the-loop methods, which rely on passively receiving human feedback (Xiao and Wang, 2023) or following pre-defined rules (https://python.langchain.com/v0.1/docs/use_cases/tool_use/human_in_the_loop/). Proactively seeking advice is a much more complicated decision-making, since the agent must estimate its own confidence in the current state, predict the potential value of the advice for future sessions, and consider the cost of experts. We will cite the works you mentioned and clearly differentiate our proactive seeking-advice mechanism from existing human expert interaction methods in our revised paper.


Q2: Scalability Concerns: The scalability of AGILE with more complex environments is not addressed. It is unclear how well the framework would perform with task complexity.

A2: Thanks for your feedback. We acknowledge the importance of addressing scalability of AGILE. To this end, we have introduced ProductQA, a complex benchmark designed to evaluate the comprehensive capabilities of the agent. ProductQA tests an agent's ability to handle historical information and accumulated knowledge, leverage tools, interact with humans, perform self-evaluation, and conduct reflection. In addition, the training and testing tasks are made disjoint to assess the agent's ability to adapt to new tasks.

Our experimental results show that AGILE agent, based on 13b LLMs and trained with PPO, outperforms GPT-4 agent on ProductQA. This demonstrates the AGILE framework's potential to scale and manage complex tasks effectively. Furthermore, AGILE is a general agent framework, allowing for various extensions, such as integrating additional tools, further enhancing its scalability with increasing task complexity.


Q3: Benchmark Selection: Why were ProductQA and MedMCQA specifically chosen as the benchmarks for this study? Are there other benchmarks where AGILE could be tested to validate its generalizability?

A3: While the proposed agent framework is general, in this paper, we evaluate it in complex question answering, a task an LLM agent has the potential of outperforming existing solutions such as the use of an LLM alone. We use ProductQA since it assesses a variety of agent capabilities including knowledge accumulation, tool-use, human interaction, reflection, etc, making it ideal for demonstrating the comprehensive ability of our framework. To validate the generality of AGILE, we select MedMCQA as an additional benchmark. This task requires extensive medical knowledge and the ability to effectively seek expert advice.

In response to your suggestion about exploring more diverse benchmarks, we perform experiments on HotPotQA, featuring natural, multi-hop questions. We train an agile agent using the Vicuna-13b as a base. The experimental results show that the agile agent outperforms GPT-4 agent by 10.8% in relative total score. Ablation studies verify that ppo training improves total score by 1.8%. Detailed results of the additional experiments will be included in the revised version to underscore the generality of the AGILE framework.

MethodAdvice RateAccuracyTotal Score
ReAct (gpt-4)-0.482-
agile-gpt4-prompt0.1940.6640.567
agile-vic13b-w/o Advice0.0000.5530.553
agile-vic13b-w/o RL0.1710.6680.617
agile-vic13b-ppo (ours)0.1560.6750.628

Q4: Human Expert Advice: Is there a ground truth for when the agent needs to seek advice, and how is this optimized?

A4: Thanks for your questions. Determining when the agent should seek advice relates to the model's confidence and cost of human experts. Since the optimal decision is model-dependent, it is difficult to establish a ground truth and perform supervised fine-tuning. However, our RL framework can effectively address this issue. By defining rewards for both correctly predicting answers and seeking advice, we can optimize the skill as part of the policy model on end-to-end RL training. For example, we assign a reward of +1 for a correct prediction and a penalty of -c for seeking advice, where c represents the human cost. This approach allows the agent to learn an optimal trade-off between relying on its own predictions and seeking external advice. We will explain the method more clearly in the revised paper.


Q5: Memory Component: Can you provide more details on how the memory component scales with an increasing number of interactions and how it ensures efficient retrieval of relevant information?

A5: Thank you for your question. The memory is designed as a scalable database where stores agent's trajectories, including question-answering pairs and agent reflections. We employ vector-based retrieval for efficient information access. Specifically, we use the all-MiniLM-L6-v2 model for embedding the text data. When an instance is stored, it is embedded into a vector representation that serves as its key in memory.

For retrieval, we embed the user question using the same model and then perform a cosine similarity search among the stored embeddings. This is a well-studied process and can be performed efficiently even when the memory is large.

评论

Dear Reviewer 5wwv,

We sincerely appreciate the time and effort you have invested in reviewing our manuscript. As the deadline of the discussion phase approaches, we would like to kindly remind you of our rebuttal, which addresses each concern you raised in your feedback.

If you have any further questions or concerns, we would be happy and eager to address them promptly in the discussion period.

Thank you once again for your valuable feedback.

作者回复

We sincerely appreciate the valuable feedback from all reviewers. We have tried our best to address each question raised in the respective reviews.

Additionally, we have conducted supplementary experiments to address certain concerns and incorporated the results in a one-page PDF file attached to this global response. The PDF includes the following:

  • Results on HotPotQA. Experiment results and ablation study on a new task: HotPotQA.
  • Robustness of RL training. Results of multiple PPO training runs, providing the mean and standard deviation of the improvement brought by RL training.
  • Additional results of RL training. Improvement achieved by our RL training with different initial policy models or different reward values.
  • RL training curves. Reward and value function loss curves during the PPO training process.
评论

Dear reviewers,

The discussion period will end soon. If you haven't responded to the authors' rebuttal, please do so and kick off the discussion.

Best, SACa

最终决定

The paper presents a framework for Large Language Model (LLM) agents designed to tackle complex conversational tasks. The AGILE framework integrates reinforcement learning (RL) with memory, tools, and human-expert interactions. It also introduces a new dataset, ProductQA, to evaluate the framework's capabilities. While the reviewers acknowledge the strengths of the paper, they also raise several concerns about the paper’s novelty and scalability.

One of the primary concerns is the limited novelty of the proposed approach, especially in the context of human-expert interaction, which has been explored extensively in prior works. Although the authors attempt to differentiate their proactive advice-seeking mechanism, the distinction is not entirely clear or compelling. Additionally, scalability issues remain unresolved, with uncertainties about how the framework would perform in more complex environments.

Further, the evaluation of AGILE is primarily limited to two datasets, raising concerns about the generalizability of the method. While the authors conducted additional experiments on another dataset in response to this concern, the lack of comparison with standard PPO and the absence of detailed training curves or standard deviations in the results were noted as weaknesses. The reviewers also highlight the similarity of the AGILE framework to existing works like WebGPT, suggesting that the paper could benefit from a clearer differentiation in the paper of its contributions and more rigorous baseline comparisons.

Despite these criticisms, the reviewers recognize the importance of RL studies in the context of LLM agents and appreciate the introduction of the ProductQA dataset (I also recommend the authors check and cite these few works: amazon-science/contextual-product-qa, Product Question Answering in E-Commerce: A Survey, Salespeople vs SalesBot, etc), which they see as a valuable resource. However, they suggest that the paper should more explicitly address its contributions, particularly regarding the conditions under which the agent seeks advice, which could be a more central focus of the work.