Language Grounded Multi-agent Reinforcement Learning with Human-interpretable Communication
We propose a novel computational pipeline to ground MARL communication in human language using embodied LLM agents, enabling interpretable and generalizable communication in ad-hoc multi-agent teamwork.
摘要
评审与讨论
The paper introduces a framework for communication in multi-agent RL. The communication space is aligned with the embedding space of natural language. The loss for training the underlying RL algorithm is augmented with a cosine similarity loss to align the communication message generated to the embeddings generated from LLMs.
优点
- The approach is quite flexible in the sense that it can be used with any standard RL algorithm
- The loss function is quite simple but effective.
- The proposed method shows better performance in task-related metrics and cosine similarity/BLEU scores compared to other baselines.
缺点
- Figure 1 could have more details about the architecture in the caption. There is a reference to the figure on line 135 where it states that more information about the architecture can be found in Fig 1. But the caption and the figure were not clear and I had to read more about it in the text.
- The experiments related to the zero-shot generalizability of the model are not convincing enough. I believe there isn’t much of a difference between and other than the vision range of the agents. Also, I am not sure if Table 3 just shows 4 instances of coming across different observations/locations in the environment. I am curious to see how this would perform in 10x10 test environments when they are trained on 5x5 grid environments. Will the communication messages be able to handle these new coordinates like (9,9), (7,8) which were previously unseen in the training scenarios?
- I was not able to understand the exact setting used for the ad-hoc teamwork experiment. In what manner were the LLM agents controlled? How was the communication message decoded from LangGround to natural language? Was there a reason for evaluating the ad-hoc teamwork on just 24 episodes in Table 4?
- The environments in which the model was tested are simple grid-world environments. The usage of LLMs to generate embedding data motivates testing it out in more complex environments (I think this has been pointed out in the future work section.).
问题
Refer to the weaknesses section for my questions.
局限性
The experimental evaluation in some of the experiments Tables 3 and 4 (refer weaknesses section above) are not extensive. Also, the procedure to obtain the results in Table 4 is not clear.
We thank the reviewer for appreciating our work, and we seek to clarify our experiment design with the following responses.
Q1: Figure 1 can be better presented
Your suggestion is well received. Thank you for pointing it out. We have remade the framework illustration in Figure D in the PDF and will add it to the final paper.
Q2: Zero-shot generalization experiment
We apologize for the confusion. We seek to clarify the design of the zero-shot generalization experiment to address your concern.
It's worth noting that there are two levels of zero-shot generalization in our work. The first level is communication alignment, where LangGround agents generate human-interpretable communication in task states they have never been grounded on during training. The second level is task completion, where LangGround agents are trained on environment A and are still able to complete the task in unseen environment B.
In Section 6.3.1, we remove a subset of prey spawn locations (i.e., 4 out of 25 cells) during training and evaluate LangGround agents with those unseen states in a zero-shot manner. This experiment requires both task-level and communication-level generalizability because those removed states are 1) unseen during training, and 2) not grounded with human language. Table 3 shows the inclusive list of 4 unseen states and the measurements of communication alignment with unseen ground truth. This is equivalent to your suggested experiment on a smaller scale, and we indeed found that "the communication messages can handle new coordinates, e.g., (1,1), (1,3), (3,1), (3,3), which were previously unseen in the training scenarios".
We acknowledge your suggestion about training on a 5x5 gridworld and testing on a 10x10 gridworld. However, such task-level generalization depends more on the backbone RL model structure and specific environment design, which is out of the scope of this work. Therefore, we present an alternative experiment focusing on communication-level generalization. We train LangGround agents on a 10x10 gridworld but only provide language grounding in a subset of states. The learning curves of LangGround agents with different levels of language grounding are presented in Fig. B and Fig. C in the PDF file. As shown in the figures, the more grounded states, the better the team performance, as well as the better the communication alignment. In addition, Table A in the PDF shows the results of communication alignment measurements in un-grounded states. Similarly, the more grounded states, the better alignment we observe in communication-level zero-shot generalization.
To summarize, the communication-level generalizability of LangGround stems from the assumption that we could ground the agent communication space with the word embedding space of human language on a limited number of instances, and expect LangGround to output interpretable messages in un-grounded states via topographic similarity of the aligned space. In practice, this assumption depends on many factors such as the number of grounded states, dimension of the communication space, scale of the problem, etc. The above experiment illustrates the impact of language grounding percentage (25%, 50%, 75%, 100%), and we leave further investigation to future work.
Q3: Ad-hoc teamwork experiment
We apologize for the confusion. Brief descriptions about ad-hoc teamwork experiments are presented in Section 6.3.2 and Appendix A.3.2 of the original paper. We will add more details to the revised paper.
Ad-hoc teamwork refers to situations where agents collaborate with unseen teammates without pre-coordination. In this work, we use embodied LLM agents to emulate human behaviors in human-agent ad-hoc teams. We match 2 MARL agents with 1 unseen LLM agent in a team and ask them to complete the collaborative task in predator-prey and USAR environments. The LLM agent is powered by GPT-4-turbo and prompted to output action selection and communication messages given observation inputs, similar to the data collection process introduced in Section 4.1. The Gym environment is wrapped with a text interface to decode observations into English descriptions and encode the LLM agent's output into concrete action selection. Both MARL agents and LLM agents interact with the same task environment in sequence. Natural language communication messages from LLMs are embedded using OpenAI's word embedding API and sent to MARL agents. The communication vectors from MARL agents are translated to English sentences via cosine similarity matching in dataset . Their team performance is measured by the number of steps spent in completing the task.
The main reason for only evaluating each ad-hoc team composition on 24 episodes is the cost of calling OpenAI's API for LLM agents. In our case, each agent consumes 2k - 5k tokens per round, and each episode takes 20 - 50 rounds to complete. Evaluating 5 team compositions on 3 environments over 24 episodes costs around $500. We plan to conduct more comprehensive evaluations in the future with different backbone models and human participants.
Q4: Realistic environments
Your suggestion is well received. As for future work, we plan to extend LangGround into more realistic task environments such as ALFWorld, RoCoBench, and ThreeDWorld Transport Challenge, where agents must physically interact with real-world objects based on visual and text inputs. Our proposed pipeline is agnostic to environments and backbone LLM and RL models, therefore it can be easily generalized to those scenarios.
We sincerely appreciate the reviewer's time and effort in helping improve our paper. We hope that our responses have adequately addressed your concerns and clarified the contributions of our work. Thank you for your valuable feedback and consideration.
I appreciate the effort taken by the authors to clarify some of the questions I had. I am happy with the clarifications given and would like to increase my score from 6 to 7.
Thank you for your appreciation. We appreciate your time and efforts in reviewing our paper. We will incorporate your suggestions into the final paper.
This paper proposes a pipeline for aligning multi-agent communication with natural language by using an LLM to collect a synthetic dataset of natural language interactions. Within a MARL communication pipeline, this dataset is used to align the learned communication space with word embeddings of reference communications. This pipeline enables human interpretable messages and ad-hoc collaboration across multiple communication-intensive MARL environments.
优点
- The research direction proposed in this work is important since it connects the literature of emergent communication with natural language.
- The paper is well written and easy to follow.
- The proposed method is easily generalizable to new settings and is usable with different communication backbones and LLMs.
缺点
- This work is missing a discussion comparing the proposed technique to the prior work of CICERO[1] from 2022. CICERO used a dataset of human messages and actions in real games to build a task-specific language model and trained a policy network that communicates by sharing embeddings that are translated through the LLM. This work has many components in common with CICERO and needs to discuss its novelty (beyond replacing human data with GPT-4 data).
- The supervised loss seems to imply that the communication protocol is just a distillation of GPT-4. Although the RL objective could theoretically help determine what information is useful to share, it seems more likely that RL would just learn to optimize actions while leaving the communication protocol to the supervised loss. Even if the RL objective does influence the communication protocol, the only alternative messages it could generate are messages corresponding to different (observation, action) pairs, so I doubt the claim in line 175 that RL determines useful information to share with other agents.
- That being said, I would agree with the notion that RL learns to "interpret" messages from other agents.
- Section 6.3.1 seems misleading to me. Specifically, the way that communication vectors are translated to English sentences are via cosine similarity matching to the most similar message the dataset (according to lines 653-654 in the appendix). Therefore, the claim that communication messages are similar to natural language sentences in unseen locations (lines 285 to 287) must be true by construction.
[1] FAIR et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097.
问题
- The BLEU score in line 267 doesn't make sense to me. It is comparing the score between the natural language messages and reference messages in D with the most similar word embedding, but in the appendix (lines 653-654) it states the communication vectors are translated to English sentences via cosine similarity matching. In that sense, shouldn't the BLEU scores be perfect?
- How does GPT-4 compare against your technique (and baseline emergent MARL communication) in the studied settings?
局限性
The limitations mentioned in the discussion are solid. I would additionally include how this technique depends on the notion that the grounding LLM "knows" what message is optimal.
We thank the reviewer for appreciating the motivation and generalizability of our work, and we seek to clarify the framing of our research.
Q1: Missing comparison with CICERO
Thank you for pointing out this paper. We indeed came across it but decided not to include it in the comparison because the motivation and technical approach of CICERO are inherently different from our work. CICERO has an RL module for strategic reasoning and a language model for generating messages. The two modules are trained separately on different datasets, namely self-play trajectories and conversation data collected from human players. During inference, the strategic reasoning model outputs actions and intent estimations conditioned on game state and dialogue history, while the language model generates messages based on game state, dialogue history, and intent estimation. It is clear that the two modules in CICERO function independently, with the only connection being that the language model takes intention estimation from the planning module as input.
However, in our work, both action and communication are generated by individual RL agents that are trained end-to-end with a combination of RL loss and supervised learning loss. As shown in Fig. 1, each LangGround agent is controlled by an LSTM policy, and its hidden state is used to generate vector for inter-agent communication. The gated average communication vector is taken by each agent's policy to output the action. We do not use an LLM to generate communication in our pipeline. Instead, a dataset of example messages is used to calculate the supervised learning loss.
Although CICERO and LangGround both result in agents capable of communicating with humans in competitive/collaborative tasks, their major contributions are fundamentally different. CICERO focuses on empirically evaluating the proposed hierarchical framework in real-world scenarios, while our work focuses on proving the concept of aligning MARL agents' emergent communication space with the semantic space of human language. We hope our explanations address your concern.
Q2: Does RL determine useful information to share or not?
Please refer to our answer to Q3 in the overall rebuttal.
Q3: Concerns about Cosine similarity and BLUE socre
We apologize for the confusion and seek to clarify the design of the zero-shot generalization experiment in the context of the previous argument about our model structure.
In Section 6.3.1, we remove a subset of prey spawn locations during training and evaluate LangGround agents with those unseen states in a zero-shot manner. To accurately retrieve unseen English sentences from the dataset, the agent needs to generalize their communication vector in the learned semantic space in a similar way as in the word embedding space.
For example, in a hypothetical scenario, the agent was only trained on prey locations (0,0) and (2,2) but needs to communicate about (1,1) during evaluation. It must generate an interpolated vector that lies between vectors referring to known locations in the high-dimensional communication space. Because this space is aligned with the word-embedding space, the interpolated vector is likely to be close to the corresponding unseen messages about (1,1) in the dataset. In practice, the generalization might not be done via linear interpolation, depending on the properties of the aligned high-dimensional space.
The above-mentioned process is the prerequisite for achieving high cosine similarity and BLEU scores, as well as retrieving unseen sentences correctly in zero-shot generalization. It is clear that this process is not guaranteed 'by construction', but rather depends on the alignment between the agent communication space and word embedding space of human language. As mentioned earlier, the LangGround agent is solving a multi-objective optimization problem with a single LSTM model. The agent policy must trade off completing the task and aligning communication, and neither of them can be solved easily. Therefore, we design different criterias to evaluate the agents in both task performance and communication alignment respectively.
We also present additional zero-shot generalization experiment results in the overall rebuttal. Please refer to our answer to Q4 for more information.
Q4: Compare GPT-4 with LangGround
The comparison results are presented in Table 4 in the original paper. The LLM row corresponds to the team performance of 3 embodied agents powered by GPT-4-turbo. As analyzed in Section 6.4.2, LLM-only teams perform worse than MARL-comm teams, and better than ad-hoc teams.
Q5: LangGround relies on LLMs' performance
Please refer to our answer to Q2 in the overall rebutall.
We sincerely appreciate the reviewer's time and effort in helping improve our paper. We hope that our responses have adequately addressed your concerns and clarified the contributions of our work. Thank you for your valuable feedback and consideration.
Thank you for your detailed rebuttal. I am satisfied with the answer to the question of Cosine similarity + BLEU scores, but I still have multiple lingering concerns.
GPT-4 Comparison
Thank you for pointing me to table 4; during my original reading of the paper, I thought the LLM row was another form of ad-hoc teamwork based on the title. However, I disagree with the analysis that LLM-only teams perform worse than MARL-comm teams given that it is significantly better in USAR and within the margin of error for the other two settings.
Does RL Determine Useful Information
From the overall rebuttal, it still seems like the overall communication protocol is just a distillation of OpenAI's text-embedding-3-large over a dataset of GPT-4 communication demonstrations, but RL helps interpret messages and determine whether to send messages in the first place.
I am surprised by the fact that removing the gating function causes a significant drop in performance, since only cooperative settings are studied in this environment, so I don't know why always communicating would harm performance (in fact the original IC3Net paper states that setting communication probability to 1.0 is often beneficial in purely cooperative settings). This seems to imply that the communications are often distracting and uninformative. This result would need to be analyzed much more in the paper to explain the role of RL in communication.
LangGround relies on LLM performance
Based on the rebuttal, it seems fair to say that LangGround's ad-hoc teamwork abilities are impacted by the performance of the LLM (or the dataset generator more generally). If the LLM cannot effectively explain an observation-action pair, LangGround would inherit this weakness. Furthermore, if the LLM never experiences an observation-action pair, the lookup will force LangGround to copy the message from another observation-action pair in the dataset, limiting generalization in more complex settings where the LLM cannot cover the whole observation space.
CICERO
I generally agree with your summary of CICERO. However, I would still expect a comparison between these two techniques in the related works section. Both have separate modules for actions and conducting actual language. In CICERO, there is a dialogue-free RL model that is grounded to language via KL regularization against a dialogue-conditioned action model, which is comparable at a high level to your entire model. The language generation component is an LLM in CICERO that is conditioned on intents, whereas it is a database lookup in yours (which is not a key contribution based on the fact it is only explained in the end of the appendix).
In my opinion, the key difference between CICERO and your work is the fact that the latent communication vector is an additional input to the RL agent instead of serving as an anchor policy. Regardless, CICERO is a key piece of related work that needs to be addressed in the related works section, especially given the last sentence of your introduction.
Generalize to unseen states
We would like to first emphasize the main contribution of our paper: we propose a general pipeline to align agent communication space with the semantic space of human language by grounding them on a small set of instances. With a well-aligned communication space, any method can be used to translate the high-dimensional vectors into natural language messages. The cosine similarity matching method used in the paper is only for the purpose of proof-of-concept. Alternatively, we could learn a sequential model for this translation process for better generalizability.
In the zero-shot experiments, we have demonstrated that LangGround performs reasonably well even when LLMs do not cover all states. We have also experimented with using a similarity measurement to approximately retrieve missing (action, obs) pairs from the dataset. Its performance is close to the current exact retrieval method while potentially generalizing better in complicated scenarios. Experimental results can be provided upon request.
CICERO
We acknowledge the high-level similarity between CICERO and LangGround, and will add the comparison to the related work section and revise the claim in the introduction. However, we would like to clarify the framing of our work to better differentiate CICERO from our approach.
LangGround is an extremely flexible pipeline with most of the modules being interchangeable, such as the word embedding model, base MARL-comm model, message dataset source, and most importantly, the translation module. As discussed previously, we focus on communication alignment and empirically evaluate it in the context of ad-hoc teamwork. The cosine similarity matching method used for translation during evaluation is only for the purpose of proof-of-concept and is subject to change.
With the above argument in mind, we believe the key difference between CICERO and LangGround is in the intermediate representations (i.e., intents in CICERO and comm vectors in LangGround) between the RL policy and translation module. In CICERO, the format of intents is defined by domain experts. An intent model is trained to identify players' intents from conversation history based on human-annotated data. In LangGround, we replace those procedures by using LLMs as the semantic knowledge base. This provides LangGround more flexibility to be applied to different domains and task scenarios. In addition, high-dimensional communication vectors are more descriptive than intents with fixed format, thus could potentially carry more contextual information from RL policy to language models. The topological similarity property of the latent communication space also allows LangGround to generalize to unseen states in a zero-shot manner.
To summarize, we acknowledge the contribution of CICERO in the area of human-interpretable communication. However, we also believe that our pipeline, LangGround, is significantly different from CICERO and provides novel contributions to the community.
Final comments
We appreciate your time and effort in reading our rebuttal and providing feedback. We hope our clarifications address your concerns and kindly request that you reconsider the overall rating.
Thank you for the additional clarifications. I agree with the new response regarding CICERO and I appreciate the clarifications regarding communication sparsity. I will summarize my current thoughts on the paper below.
-
I consider prompting the LLM to be a baseline method given that it is used to generate the language dataset to begin with. Based on the results of this paper, it seems like using the LLM directly would still be better for solving these tasks. When I gave my original score, I thought there would be an argument that RL could enable significantly stronger actions than the baseline LLM (despite distilling the key communication protocol), but after reviewing Table 4 I saw that this wasn't the case in the studied environments.
-
The gating experiments in the rebuttal are interesting but underexplored. LangGround's communication probability is ~80%, but removing the gate entirely causes an extremely large reduction in performance whereas IC3Net seems to be barely impacted. This result is very surprising and could lead to evidence that simply replacing the communication protocol with the LLM/dataset would not perform as well as LangGround.
-
My key takeaway from this paper is that existing general-purpose LLMs and embedding models can be used to create a communication protocol on-par with SOTA tabula-rasa emergent communication techniques in the studied environments. The generalization results imply that interpolation does occur, though there is a significant drop in performance when only 1/4 of the states are grounded (and similarity metrics are unimpressive when 1/2 of the states or less are grounded), raising doubts about the scalability of this technique for more complex settings.
Compare LangGround with LLM embodied agents
Although comparing LLM embodied agents with MARL agents (e.g., LangGround) is not the main focus of this paper, we provide our analysis in three parts:
-
Many studies evaluate LLM's planning capability for embodied agents. It is widely acknowledged that embodied LLM agents may generate infeasible action plans in interactive tasks due to hallucinations and bias from training data [1]. Consequently, MARL agents are generally more efficient in completing tasks than pre-trained LLM agents [2]. Specifically in our case, LLM and LangGround perform similarly because 1) LangGround is not 100% optimized for task reward, 2) LangGround does not use SOTA MARL models as the backbone (e.g., MAPPO), and 3) LLM embodied agents are heavily prompt-engineered for specific environments.
-
LLMs have a large number of parameters and thus take longer to run. In our case, calling OpenAI's API is ~50 times slower than running local inference of MARL models and costs ~5 dollars for each episode. Taking the massive time and resources required to train LLMs into consideration, MARL models are more affordable and eco-friendly in solving specific tasks.
-
The effort required for generalizing LangGround to different scenarios is smaller compared to embodied LLM methods. Because LangGround only uses effective communication messages from LLMs, a mere amount of prompt engineering is required compared to alternative methods that require LLMs to generate actionable plans for each step.
Gating experiment
We ran this ablation study to answer your original question about attributing RL to communication. We came to similar conclusions that MARL indeed learns to selectively communicate with the help of language grounding, i.e., mediator effect. We also find the results interesting and plan to further explore them in future work. It would be helpful for us to improve the work if you could specify the potential research directions you have in mind.
Generalization
We would like to clarify the experimental settings to help you better evaluate the zero-shot generalization results. The dataset we used in the evaluation is constructed from limited trajectories of LLM agents, thus usually does not cover all (obs, action) combinations. For example, the full dataset of contains 1893 data points which covers only 45 out of 50 possible observations. The additional zero-shot generalization experiments were conducted on a 10 by 10 map with an even smaller of size ~1000 data points, due to time and resource limitation. This means the 100% condition in the attached PDF represents roughly 12.5% (the map size is 4 times bigger, and the dataset is half the size) of all possible (o, a) combinations. With this being said, the generalization performance we observe in Table A is approximately between 12.5% to 3% states grounded. We will revise the description to make the experimental setting clearer. We hope this could help you calibrate your evaluation of the scalability of LangGround in more complex scenarios.
[1] LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks. arXiv 2024
[2] Theory of Mind for Multi-Agent Collaboration via Large Language Models. EMNLP 2023
Thank you for the clarification regarding the latest generalization results! I see that this result is much stronger than I initially realized, demonstrating that the communication protocol has genuine non-trivial interpolating capabilities. I see this as a valuable result and will increase my score accordingly.
Regarding the gating experiments for future work, I would've liked to see if the "communication" network could've been replaced entirely by the lookup procedure (or an approximate lookup if not exactly available) since this is treated as the ground truth in supervised learning. Furthermore, I would've liked qualitative observations regarding the discarded messages since these seem to have a large impact on task performance if left in.
Thanks for your timely reply. We appreciate your acknowledgment of the newly added experimental results and the suggestions for future directions. We will revise the paper accordingly.
We are sorry to see you have changed your overall rating from '5 = borderline accept' to '3 = reject' after reading our rebuttal and additional experiment results. We seek to provide additional clarifications for your remaining concerns.
GPT-4 comparison
We acknowledge your evaluation of the results in Table 4 that the LLM-only team performs on par with MARL-only teams. However, as pointed out in Section 6.3.2, the comparison among homogeneous teams is not the focus of ad-hoc teamwork experiments. Instead, we try to quantify the performance loss when introducing an unseen teammate into the team, and the degree of human-agent communication achieved via LangGround compared to random baselines, by comparing different ad-hoc teams (e.g., LLM+ aeComm, LLM+LangGround). Therefore, whether pure LangGround teams perform better than pure GPT-4 teams does not impact the conclusion we draw in the paper. We do not see this as a weakness of our work by any means.
Communication protocol is a distillation of example messages
We acknowledge your hypothesis about "distillation," as it is a possible analogy of our proposed pipeline. However, as mentioned in the overall rebuttal, it is impossible to fully decouple the contribution of RL and SL with our current model structure. We have no way to verify the hypothesis about "interpretation" and can only attribute both communication and action to RL and SL as a multi-objective optimization problem.
More importantly, whether the hypothesis is true or false does not diminish the value of our work. Many previous works in multi-agent communication learn a latent representation of observation and directly use the encoding as a communication vector [1, 2]. Our method works similarly, but we use word embeddings of observation-action descriptions as the latent representation. Whether RL determines information to share or "interprets" messages are merely different technical approaches, rather than critical concerns that limit our pipeline.
Communication sparsity
We seek to clarify the misunderstanding you might have regarding the experiment settings and the value of sparse communication.
- We used the semi-cooperative setting for predator-prey (i.e., mixed mode) where agents receive individual rewards. The original statement in IC3Net you cited only applies to fully-cooperative settings where agents receive a global shared reward. In fact, the communication probability learned by LangGround agents is around 80%, which aligns with the results reported in the IC3Net paper.
- The value of sparse communication is well acknowledged by the research community. For example, IC3Net outperforms CommNet in mixed predator-prey by introducing the gating function to allow agents to learn to communicate when necessary. Previous research proves that a sparse communication protocol can be learned with little to no performance loss [3]. The statement that 'learned communications are often uninformative' is exactly the motivation for exploring sparse communication. We see this as the nature of agent and human communication, rather than a weakness of our work. We leave further exploration to future work because communication sparsity is out of the scope of this paper.
LangGround relies on LLMs
We acknowledge your statement that LangGround's communication depends on the quality of the example dataset. However, the collection of this dataset is fairly accessible. LLMs are known to have good linguistic capabilities (e.g., describing) while struggling with formal reasoning (e.g., planning). Our pipeline actually allocates the appropriate task to LLMs, i.e., describing the obs-action pairs, and leaving the planning part for RL. In more complicated scenarios in which communication is beyond describing observation and action, we could still expect LLMs to generate reasonable outputs since they were trained on massive conversations and dialogue data. Compared to alternative methods in embodied agents where LLMs must make correct action planning at every timestamp, it is more feasible to collecte semantically meaningful messages from either LLMs or any other sources.
To be continued.
[1] Learning to ground multi-agent communication with autoencoders. NeurIPS 2022
[2] Learning to Communicate using Contrastive Learning. ICLR 2024
[3] Towards True Lossless Sparse Communication in Multi-Agent Systems. ICRA 2023
The paper presents a novel computational pipeline aimed at aligning the communication space of Multi-Agent Reinforcement Learning (MARL) agents with an embedding space of human natural language. The authors propose grounding agent communications on synthetic data generated by embodied Large Language Models (LLMs) in interactive teamwork scenarios.
优点
- The use of synthetic data generated by LLMs to align MARL agents' communication is a creative application of existing technologies in a new way, demonstrating originality in methodology.
- While the paper does not present new theoretical results, it does provide a solid empirical foundation for its approach, which is well-supported by experiments.
- The proposed computational pipeline appears technically sound, with a clear explanation of how it aligns with human language and the rationale behind the design choices.
- The authors provide clear explanations of complex concepts, such as the alignment of communication spaces and the grounding process, making the paper accessible to a broader audience.
缺点
- The paper does not provide a theoretical framework or proofs to support the empirical findings. Developing a theoretical basis could strengthen the claims and provide deeper insights into why the approach works.
- The experiments are conducted in controlled environments. To strengthen the claims, testing the approach in more diverse and complex scenarios could provide evidence of broader applicability.
- The paper relies heavily on synthetic data generated by LLMs. There might be concerns about the representativeness of this data for real-world scenarios.
- Some details regarding the implementation of the MARL agents and the interaction with LLMs could be better elaborated.
问题
See weakness
局限性
See weakness
We thank the reviewer for acknowledging the originality and soundness of LangGround, our proposed method for MARL agents with human-interpretable communication. We seek to clarify your questions with the following responses and additional experimental results.
Q1: No theoretical proof
We do not provide a proof for our work because the theories we used in deriving the proposed method are from previous literature in emergent communication and MARL-comm. However, we indeed provided analysis to explain why our method works in practice in the discussion section.
The core idea of LangGround is providing an auxiliary loss to directly regulate intermediate hidden layers of the RL agent's policy network. This technique has been proven effective in stabilizing the learning process via additional supervision [1]. In MARL-comm, previous work has used representation learning methods (e.g., autoencoder [2], contrastive learning [3]) to construct communication messages solely based on encoding observations. Our pipeline combines these two techniques by using language priors to regulate RL agent's communication via supervised learning. Our dataset consists of expert trajectories from LLM embodied agents with a well-established grounding on the task. We believe our method helps agent learn a semantic representation of task-related information grounded on word embeddings of example messages.
The other relevant theory is the Information Bottleneck Theory. In emergent communication, a speaker and a listener jointly learn a communication protocol to maximize the task reward. But this may compromise other properties of the learned language such as task-agnostic informativeness and human interpretability [4]. To overcome this issue, we align the agent communication space with the semantic space of human language by grounding them on a small set of data, and add the supervised learning loss to the task loss [5]. This explain the trade-off between utility and informativeness we observe across task environments, representing different solutions of the multi-objective optimization problem's Pareto front.
Q2: Evaluate LangGround in scaled environments
Please refer to our answer to Q1 in the overall rebutall.
Q3: LangGround relies on LLMs' performance
Please refer to our answer to Q2 in the overall rebutall.
Q4: Details about MARL implementation and ad-hoc teamwork are missing
We apologize for the confusion. Descriptions about ad-hoc teamwork experiments are presented in Section 6.3.2 and Appendix A.3.2 of the original paper. Implementation details of MARL agents are presented in Section 4.2 and Appendix A.1.2. We provide additional descriptions and a re-made framework illustration in Figure D in the PDF. We will add them to the final paper.
Ad-hoc teamwork refers to situations where agents collaborate with unseen teammates without pre-coordination. In this work, we use embodied LLM agents to emulate human behaviors in human-agent ad-hoc teams. We match 2 MARL agents with 1 unseen LLM agent in a team and ask them to complete the collaborative task in predator-prey and USAR environments. The LLM agent is powered by GPT-4-turbo and prompted to output action selection and communication messages given observation inputs, similar to the data collection process introduced in Section 4.1. The Gym environment is wrapped with a text interface to decode observations into English descriptions and encode LLM agent's output into concrete action selection. Both MARL agents and LLM agents interact with the same task environment in sequence. Natural language communication messages from LLMs are embedded using OpenAI's word embedding API and sent to MARL agents. The communication vectors from MARL agents are translated to English sentences via cosine similarity matching in dataset . Their team performance is measured by the number of steps spent in completing the task.
We sincerely appreciate the reviewer's time and effort in helping improve our paper. We hope that our responses have adequately addressed your concerns and clarified the contributions of our work. Thank you for your valuable feedback and consideration.
[1] Lee, C. Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015, February). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562-570). Pmlr.
[2] Lin, T., Huh, J., Stauffer, C., Lim, S. N., & Isola, P. (2021). Learning to ground multi-agent communication with autoencoders. Advances in Neural Information Processing Systems, 34, 15230-15242.
[3] Lo, Y. L., & Sengupta, B. (2022). Learning to ground decentralized multi-agent communication with contrastive learning. arXiv preprint arXiv:2203.03344.
[4] Tucker, M., Levy, R., Shah, J. A., & Zaslavsky, N. (2022). Trading off utility, informativeness, and complexity in emergent communication. Advances in neural information processing systems, 35, 22214-22228.
[5] Tucker, M., Li, H., Agrawal, S., Hughes, D., Sycara, K., Lewis, M., & Shah, J. A. (2021). Emergent discrete communication in semantic spaces. Advances in Neural Information Processing Systems, 34, 10574-10586.
The authors have provided a detailed rebuttal. Can Reviewer hS12 share their thoughts?
After reading the rebuttal of the authors and the discussion between the authors and other reviewers, I think my concerns are mainly addressed. I would raise my rating.
We appreciate again your time and effort for reviewing our paper. We are glad that you are satisfied with the rebuttal and appreciate your re-evaluation of our paper based on the clarifications provided during the discussion period.
The paper presents a method of using LLMs with synthetic data to generate human interpretable multi-agent communication protocols using zero-shot learning. The model, called language-grounded multi-agent communication (LangGround) aligns the communication of the multi-agent model with the LLM based communication.
The model is tested on 2 versions of the Predator-Prey and 1 version of the USAR environments against multiple baselines, showing that the model presented in the paper performs comparable or better than state-of-the-art models in Multi-Agent Communication. Moreover, the paper shows that the communication protocol is generalizable and allows for Human interpretation and the ability to work in teams with LLM-based agents.
优点
The model is novel and experiments show that it outperforms other state of the art baselines.
The related works is of sufficient length for a paper of given length.
The approach is clear and the figures are easy to understand.
Testing against multiple MARL Environments and configurations provide a convincing argument.
The shared supplementary materials allow for easy reproducibility.
缺点
There could be a Human Subjects experiment to show that the model that is presented is actually preferred by humans over alternative models.
The experiments could be expanded to show the performance of the model at different scales.
The differences between the model presented in the paper and [24] can be highlighted more. While section 4.1 goes into detail that the Language model used is derived from the previous work, it may be better to stress the primary differences given that both models are tested on the USAR environment.
问题
How is this model more advantageous than a pure LLM based environment as presented in [24]?
How open is this model to hallucinations from the LLM model?
How does the performance of the LLM Model used affect the performance of the trained LangGround Model?
局限性
Prompt engineering is required to allow for the model to be integrated into new environments.
The model that is trained depends on the performance of the Language Model.
The models are shown to trade off utility and informativeness, causing a decrease in performance for interpretability.
We thank the reviewer for acknowledging the novelty and soundness of LangGround, our proposed method for MARL agents with human-interpretable communication. We appreciate your constructive comments and suggestions and seek to clarify them with the following responses and additional experiment results.
Q1: Human Subjects experiment
We acknowledge the necessity of evaluating our method with real human participants given the ultimate goal of learning human-interpretable communication. However, we decide to leave human subject experiments for future work based on the following considerations:
- We want to focus the scope of this paper on the technical details of the proposed pipeline and evaluation of aligned communication space. Human experiment results might be influenced by many factors (e.g. demographics, background, attitude to AI) therefore need more considerations.
- Many works have proved the effectiveness of LLMs in simulating human behaviors and replicating data collection from human participants [1]. This is our original motivation for using LLMs to generate synthetic data for LangGround. It is a reasonable first step to conduct a sanity check by evaluating LangGround in ad-hoc teamwork with embodied LLM agents, before directly testing it with human subjects.
- Our method is the only model that learns an aligned communication space with natural language, facilitating direct conversation with humans. Alternative models either do not have any interpretability (e.g. IC3Net, aeComm), or only learn a semantically meaningful space independent from human language (e.g. protoComm, VQ-VIB). In the ad-hoc teamwork experiments, we actually give baseline models an advantage by allowing them to use the offline dataset of LangGround and OpenAI's word embedding model. Otherwise, no alternative models are able to communicate with embodied LLM agents or humans in natural language.
Q2: Evaluate LangGround in scaled environments
Please refer to our answer to Q1 in the overall rebutall.
Q3: What is the primary difference between LangGround and [2]
Thank you for pointing out potential directions for improving the presentation of our work. We will add more details to explain the method used in previous work [2] and compare it with LangGround.
Essentially, [2] proposes a prompt engineering technique to improve the performance of embodied LLM agents in collaborative tasks, by allowing them to keep track of key task-related information. They evaluated this method in USAR environment and proved its effectiveness. While in this paper we propose a pipeline to improve the robustness and interpretability of emergent communication learned by Multi-Agent Reinforcement Learning agents. Therefore, the motivations and technical approaches of [2] and LangGround are inherently different, although both result in artificial agents capable of collaborating and communicating with humans.
Q4: Compare MARL-comm agents with embodied LLM agents in general
Regarding this question, we provide the argument in two-fold:
- MARL and LLM are trained with different learning objectives: MARL is optimized for maximizing the expected reward from environment, while LLMs are optimized to predict the next word given the context. Embodied LLM agents may generate infeasible action plans in interactive tasks due to hallucinations and bias from training data. Consequently, MARL agents are generally more efficient in completing tasks than pre-trained LLM agents [2].
- LLMs have a large size of parameters thus take longer to run. In our case, calling OpenAI's API is ~50 times slower than running local inference of MARL models, and costs ~5 dollars for each eposide. Taking the massive time and resources required to train LLMs into consideration, MARL models are more affordable and eco-friendly in solving specific tasks.
Q5: Impact of LLM's hallucination and performance on LangGround
Please refer to our answer to Q2 in the overall rebutall.
Q6: Prompt engineering is required for generalization to new environments
In practice, yes. But the amount of effort required for generalization is much smaller when compared to alternative methods in either embodied LLM agents or MARL-comm literature. The rationale is similar to the argument we made for Q5, because LangGround only uses effective communication messages from LLMs, a mere amount of prompt engineering is required compared to alternative methods that require LLMs to generate actionable plans for each step. Learning communication protocol from scratch is known to be challenging in the MARL community. Our method stabilizes the learning process and reduces the amount of engineering work (e.g. hyper-parameter searching) when generalizing to new domains, by introducing language priors from general knowledge base, LLMs.
We sincerely appreciate the reviewer's time and effort in helping improve our paper. We hope that our responses have adequately addressed your concerns and clarified the contributions of our work. Thank you for your valuable feedback and consideration.
[1] Using large language models to simulate multiple humans and replicate human subject studies. ICML 2023
[2] Theory of Mind for Multi-Agent Collaboration via Large Language Models. EMNLP 2023
The authors have provided a detailed rebuttal. Can Reviewer hS12 share their thoughts?
We thank all the reviewers for their time and effort in helping us improve the paper. We appreciate your acknowledgment of the novelty and soundness of LangGround, our proposed method for MARL agents with human-interpretable communication. In this rebuttal, we seek to clarify a few common questions and provide additional experiment results upon request.
Q1: Evaluate LangGround in scaled environments
Upon your request, we ran additional experiments in the Predator Prey environment with a larger map size (10 by 10). The learning curves of LangGround and baselines are presented in Fig A in the rebuttal PDF file. As shown in the figure, our method outperforms ablating baselines without language grounding (i.e., IC3Net) or without communication (i.e., noComm). This result demonstrates the benefit of introducing LangGround in stabilizing the learning process of emergent communication of MARL agents in scaled environments.
As for future work, we plan to extend LangGround into more realistic task environments such as ALFWorld, RoCoBench, and ThreeDWorld Transport Challenge, where agents must physically interact with real-world objects based on visual and text inputs. Our proposed pipeline is agnostic to environments, LLMs, and MARL-comm model structure; therefore, it can be easily generalized to those scenarios.
Q2: LangGround relies on LLMs' performance
Essentially, LLM's hallucination and task performance do not directly affect LangGround's performance. We use embodied LLM agents to collect grounding communication datasets for LangGround as explained in Section 4.1. These datasets can also come from rule-based agents or human participants, as long as they show effective communication in solving collaborative tasks. Since only communication messages are used, LLM agents' task performance does not impact MARL's performance. For example, based on our observations during experiments in the predator-prey environment, due to hallucination LLM agents may incorrectly interpret the task as encircling the prey instead of reaching it, stopping one step away and saying, "I found the prey at (x, y), ready to encircle it." Despite not completing the task, their communication remains valid in reporting the prey location. During training, MARL agents optimize their policy for reaching the prey and aligning communication messages, so LLM agents' sub-optimal behaviors are not learned. The only minor issue is that MARL might learn to use the word 'encircle,' which could cause slight issues in ad-hoc teamwork with humans.
Q3: Does RL determine useful information to share or not?
This is a very good question targeting the core mechanism of LangGround, attributing reinforcement learning signal and supervised learning signal to the agent's action and communication output in the multi-objective optimization problem. Thank you for bringing this up.
Because the LangGround agent is trained end-to-end with a combination of RL and SL loss and uses the intermediate hidden state of its policy as the communication vector, it is very hard to separate the reasoning processes of action and communication. However, we could provide indirect evidence to prove that RL and SL jointly contribute to both the agent's action and communication.
- The MARL-comm agent uses a gating function to learn whether to communicate at specific timestamps. We could ablate this function to see its impact on team performance. As shown in the following table and Fig. E in the PDF, removing the gating function harms the performance of LangGround more than IC3Net. This means both RL and SL signals influence the content and timing of LangGround communication.
| Model | Performance loss after removing the gating function |
|---|---|
| IC3Net | -9.05 |
| LangGround | -34.09 |
- In addition, we could manipulate the weight of supervised learning loss, i.e. in function (4), to illustrate the contribution of RL and SL signals. As shown in Fig. F and G in the PDF, matters for both task performance and supervised loss. If the SL loss is weighted too high, the LangGround agent cannot optimize its policy in completing the task. While if the RL loss is weighted too high, the LangGround agent cannot align its communication with human language. This result aligns with our claim that RL optimizes the communication for task utility and SL optimizes the communication for alignment.
Q4: Zero-shot generalization experiment
We present additional experiments to verify the zero-shot generalization capability of LangGround. Specificlly, we train LangGround agents on 10x10 Predator Prey but only provide language grounding in a subset of states (i.e., 25%, 50%, 75%, 100%). The learning curves of LangGround agents with different levels of language grounding are presented in Fig. B and Fig. C in the PDF file. As shown in the figures, the more grounded states, the better the team performance, as well as the better the communication alignment. In addition, Table A in the PDF shows the results of communication alignment in un-grounded states during evaluation. Similarly, the more grounded states, the better alignment we observe in communication-level zero-shot generalization.
To summarize, the generalizability of LangGround stems from the assumption that we could ground the agent communication space with the word embedding space of human language on a limited number of instances, and expect LangGround to output interpretable messages in un-grounded states via topographic similarity of the aligned space. In practice, this assumption depends on many factors such as the number of grounded states, dimension of the communication space, scale of the problem, etc. The above experiment illustrates the impact of language grounding percentage, and we leave further investigation to future work.
This paper studies the problem on Multi-Agent Reinforcement Learning in the context of embodied Large Language Models (LLMs). The paper investigates the communication patterns of LLMs vis-a-vis human natural language, finding positive results for their technical approach by distilling and aligning grounded agent communication within an embedding space of human natural language.
Overall, the reviewers were mostly positive with only one borderline review. Multiple reviewers were positively impressed with the rebuttal and raised their scores. Concerns about the paper's weaknesses include the need for the LLM agents to be able to generate good messages in order to create a good dataset; further, it may be impossible to generate "novel" messages with the current approach. Nonetheless, the reviewers agreed that the paper's greatest strength was its novelty: it can make a contribution towards MARL and folks modeling natural language.